Kerlinger - Foundations of Behavioral Research - 3rd Ed
Kerlinger - Foundations of Behavioral Research - 3rd Ed
Kerlinger - Foundations of Behavioral Research - 3rd Ed
KERLINGER
BEHAVIORAL
RESEARCH
Third Editioti^
Digitized by tine Internet Arciiive
in 2009
littp://www.arcliive.org/details/foundationsofbeliOOkerl
Foundations of
Behavioral Research
/
4
ill'
S
Si^
li
OUNDATIONS
OF BEHAVIORAL
RESEARCH
THIRD EDITION
Copyright ©
1986, 1973, 1964 by Holt, Rinehart
and Winston, Inc.
Requests tor permission to make copies of any part of the work should be mailed
to: Copyrights and Permissions Department, Harcourt Brace jovanovich, PLiblisliers,
Some activities command more interest, devotion, and enthusiasm than do others. So it
seems to be with science and with art. Why an interesting and significant
this is so is
psychological question to which there is no unequivocal answer. All that seems to be clear
is once we become immersed in scientific research or artistic expression we devote
that
most of our thoughts, energies, and emotions to these activities. It seems a far cry from
science to art. But in one respect at least they are similar: we make passionate commit-
ments to them.'
This is a book on scientific behavioral research. Above everything else, it aims to
convey the exciting quality of research in general, and in the behavioral sciences and
education in particular. A large portion of the book is focused on abstract conceptual and
technical matters, but behind the discussion is the conviction that research is a deeply
absorbing and vitally inferesting business.
It may seem strange in a that I talk about interest, enthusiasm, and
book on research
passionate commitment. Shouldn't be objective? Shouldn't we develop a hardheaded
we
attitude toward psychological, sociological, and educational phenomena? Yes, of course.
But more important is somehow to catch the essential quality of the excitement of discov-
ery that comes from research well done. Then the difficulties and frustrations of the
research enterprise, while they never vanish, are much less significant. What I am trying
to say is that strong subjective involvement is a powerful motivator for acquiring an
objective approach to the study of phenomena. It is doubtful that any significant work is
ever done without great personal involvement. It is doubtful that students can learn much
'The term "passionate commitment" is Polanyi's. M. Polanyi, Personal Knowledge. Chicago: University
of Chicago Press, 1958.
vii
viii • Preface
about science, research design, and research methods without considerable personal in-
volvement. Thus I would encourage students to discuss, argue, debate, and even fight
about research. Take a stand. Be opinionated. Later try to soften the opinionation into
intelligent conviction and controlled emotional commitment.
The writing of this book has been strongly influenced by the book's major purpose; to
help students understand the fundamental nature of the scientific approach to problem
solution. Technical and methodological problems have been considered at length. One
cannot understand any complex human activity, especially scientific research activity,
without some technical and methodological competence. But technical competence is
empty without an understanding of the basic intent and nature of scientific research: the
controlled and objective study of the relations among phenomena. All else is subordinate
to this. Thus the book, as its name indicates, strongly emphasizes the fundamentals or
foundations of behavioral research.
To accomplish the major purpose indicated above, the book has four distinctive gen-
eral features. First, it is a treatise on scientific research; it is limited to what is generally
accepted as the scientific approach. It does not discuss historical research, legal research,
library research, philosophical inquiry, and so on.' It emphasizes, in short, understanding
scientific research problem solution.
Second, the student is led to grasp the intimate and often difficult relations between a
research problem and the design and methodology of its solution. While methodological
problems are treated at length, the book is not a "methods'" book. Stress is always on the
research problem, the design of research, and the relation between the two. The student is
encouraged to think relationally and structurally.
Third, the content of much of the book is tied together with the notions of set, relation,
and variance. These ideas, together with those of probability theory, statistics, and meas-
urement, are used to integrate the diverse content of research activity into a unified and
coherent whole.
Fourth, a good bit of the book's discussion is slanted toward psychological, sociologi-
cal, and educational research problems. It seemed to me that a foundational research book
was needed in education. But there is little scientific research in education that is uniquely
educational; for the most part it is behavioral research, research basically psychological
and sociological in nature. In sum, while this is a book on the intellectual and technical
foundations of scientific behavioral research in general, it emphasizes psychological,
sociological, and educational problems and examples, while not ignoring other behavioral
disciplines.
The book's content is organized into ten parts. In Part 1 , the language and approach of
science are studied. Its problems
three chapters discuss the nature of science, scientific
and hypotheses, and the notions of variables, constructs, and definitions. Part 2 presents
the conceptual and mathematical foundations. Much of the presentation of conceptual and
technical matters, as indicated above, is based on the ideas of set, relation, and variance.
These terms are defined using modem mathematical theory. Fortunately the theory is
simple, though the reader may feel a bit strange at first. After becoming accustomed to the
language and thinking, however, he will find that he possesses powerful instruments for
understanding later subjects.
It is impossible to do competent research or to read and understand research reports
" Historical inquiry and methodological research are briefly discussed in Appendix A.
Preface • ix
ses the highly inipt)rtant subject of the principles of analysis and interpretation. This
chapter and the following chapter on the analysis of frequencies were inserted at this point
to make clear the purpose of quantitative analysis and statistics. Indeed, this is what Parts
3. 4, and 5 arc about: drawing inferences from data with quantitative methods. Interpreta-
tion isin essence drawing inferences, and the ultimate purpose of quantitative analysis and
statistics is interpretation.
Parts 1 through 5. then, provide an important part of the conceptual and mathematical
foundations of behavioral research. The remainder of the book uses these foundations to
attack problems of design, measurement, and observation and data collection.
Part Six, "Designs of Research," is the structural heart of the book. Here the major
designs of experimental and nonexperimental research are outlined and explained. Part
Seven, on types of research, follows naturally from Part Six: nonexperimental research
and the distinctions among laboratory experiments, field experiments, field studies, and
survey research are explored.
Part Eight addresses itself mainly to theoretical measurement problems, while Part
Nine addresses itself to practical and technical problems of gathering the data necessary
for scientific problem solution. Standard methods of observation and data collection
interviews, objective tests and scales, direct observation of behavior, projective methods,
content analysis, the use and analysis of available materials, and sociometry are exten- —
sively discussed and illustrated. Q methodology has been given a separate chapter because
of its importance and research potential, its distinctive nature, and evident lack of under-
standing of its characteristics by behavioral researchers.^
The book ends with fairly extended but elementary discussions of multiple regression,
multivariate analysis, factor analysis, and analysis of covariance structures. The very
nature of behavioral research is multivariable: many variables act in complex ways to
influence other variables. While some of the complexity can be handled with analysis of
variance, it is only with multivariate methods that the complexity of many psychological,
sociological, and educational problems can be adequately attacked. We are in the midst of
a revolution in research thinking and practice. Behavioral research has been changing
from a predominantly univariate emphasis to a multivariate emphasis. The change is
extensive and profound. Even the nature of theory and problems is changing. Before
finishing the book, I hope the reader will become convinced of the necessity of consider-
ing at some length such relatively difficult subjects as multiple regression and factor
analysis.
We need to explain further why Part Ten on multivariate analysis, and especially
Chapter 36 on analysis of covariance structures, seemed necessary. Some experts may say
that the chapter does not belong in an elementary text. Of course some will say this about
the whole of Part Ten on multivariate approaches. I readily grant that many students may
from Chapters 33 through 36, especially if their instruction lacks enthusiasm
profit little
for multivariate analysis. There can be little doubt, however, of the great importance and
widespread and profitable use of the powerful and fundamental approaches of multiple
regression and factor analysis in psychology, sociology, and education. One cannot con-
ceive of modem behavioral research without also recognizing the necessity for students of
research to study these admittedly difficult yet indispensable approaches to research prob-
lems.
Even the relatively advanced student may have trouble with the mathematics and the
scientific reasoning of analysis of covariance structures, not only because of the intrinsic
difficulty of the mathematics (yet it isn't all that difficult), but also because of the ab-
^This lack of understanding has again been demonstrated recently when researchers have used Q sorts as
though they were normative measures and neglected the original thinking and work of Q's originator, William
Stephenson (see Chapter 32).
X • Preface
stractness and generality of the system. I recall vividly my own perplexity when a Dutch
of earlier discussions.
To aid student study and understanding, and to help surmount some of the inherent
difficulties of the subject, several deviceshave been used. One, many topics have been
discussed at length. If a choice had to be made between repetition and possible lack of
student understanding, material was repeated, though in different words with different
examples. Two, many examples from actual research as well as many hypothetical exam-
ples have been used. The student who reads the book through will have been exposed to
a large number and a wide variety of problems, hypotheses, designs, and data and to many
actual research studies in the social sciences and education.
Three, an important feature of the book is the frequent use of simple numerical exam-
ples in which the numbers are only those between and 9. The fundamental ideas of
statistics, measurement, and design can be conveyed as well with small numbers as with
large numbers, without the additional burden of tedious arithmetic computations. It is
suggested that the reader work through each example at least once. Intelligent handling of
data is indispensable to understanding research design and methodology.
"If readers experience undue difficulty with the chapter. I suggest omitting it from formal study.
Preface • xi
Four, most chapters have study suggestions that include readings as well as problems
designed to help integrate and consolidate the material presented in the chapters. Many of
them arose from practical use with graduate students. Answers to most of the computa-
tional problems have been given immediately after the problems. An answer, if checked
against a supplied answer and found to be correct, reassures students about computational
details. They should not have to waste time wondering about right answers. Understand-
ing the procedures is what is important and not the calculations as such.
How does this edition of Foundations differ from the first and second editions? One,
many of the errors and gaucheries in the earlier editions have been detected and corrected.
Two, most of the research examples have been replaced with more recent and, in many
cases, more interesting studies. Moreover, a greater effort has been made to include more
studiesfrom psychology, sociology, and other behavioral disciplines rather than such a
heavy concentration of educational studies. Three, the examples used have in most cases
been taken from multivariate studies and research problems. Four, the attitude toward
computers and computer use remains highly favorable, but this enthusiasm has been
tempered by skepticism and doubt of the wisdom of unsupervised and uncontrolled use of
computer package programs by graduate students and faculty. These doubts have been
expressed mainly in Appendix B, but they have also influenced the presentation in other
parts of the book.
Although much of the text and some of the organization of the book have been
changed, its basic approach and purpose remain the same: understanding of principles of
research through relatively lengthy explanation and many examples. No one can be com-
pletely satisfied with the organization and content of a book. One of the goals of all three
editions has been to supplyenough materials for instructors and students of diverse back-
ground and what they need. Even so, the content of the book is still highly
taste to select
selective when one considers the great diversity and depth of modem behavioral research
and methods. I hope my selection will serve the teaching and learning needs of most
instructors and students.
All books are cooperative enterprises. Though one person may undertake the actual
writing, he dependent on many others for ideas, criticism, and support. Among the
is
many persons who contributed to this book, I am most indebted to those mentioned
below. I here express my sincere thanks to them.
Three individuals read the entire original manuscript of the first edition and made
many valuable and constructive suggestions for improvement: Professors T. Newcomb,
D. Harris, and J. Nunnally. Professor Newcomb also furnished the early prodding and
encouragement needed to get the book going. Professor Harris contributed from his wide
research experience insights whose worth cannot be weighed. The late Professor Nun-
nally "s trenchant and penetrating analysis was invaluable, especially with a number of
difficult technical matters.
I am grateful to the many teachers and students who have corresponded with me about
aspects of the book (especially the errors). All suggested corrections and changes have
been given careful consideration. I owe a large debt in the writing of both revisions to my
colleague and friend. Professor E. Pedhazur, and to Professor E. Page. They have ferreted
out weaknesses and made many suggestions for improvement. I also want to express my
gratitude to my former colleagues of the Psychology Laboratory, University of Amster-
dam, who pointed out errors and ambiguities in the text, some of which I have been able
to correct.
The editors of Holt, Rinehart and Winston who were responsible for the first two
editions of the book and a substantial part of this edition, Richard Owen and David
Boynton, both contributed greatly to the content and style of the book itself with perspica-
xii • Preface
cious insights and suggestions and with steady and unfailing psychological support. 1 am
very grateful to them for their work with me and for being conscientious, scrupulous, and
creative editors.
Authors rarely recognize and acknowledge the work and contributions of project edi-
tors. Project editors —
I prefer to say "super editors" —
are those individuals who solve
the many technical communication problems encountered in preparing book manuscripts
for publication. I would like to mention the extraordinary competence, insight, and aes-
thetic and technical creativity of the individual who has prepared this book for the printer:
Jeanette Ninas Johnson. I am very grateful and here acknowledge her valuable contribu-
tion.
It is doubtful that this book could have been written without the sabbatical leaves
given me in 1961-1962 and 1970-1971 by New York University. I am grateful to the
University for generous sabbatical policy.
its
The price a family pays for an author's book is high. Its members put up with his
obsession and his unpredictable writing ups and downs. I express my gratitude and indebt-
edness to my wife and sons by dedicating the book to them. I must say more than this,
however. My wife has had to cope with two overseas moves and one transcontinental
move, two retirements, and innumerable logistical and temperamental problems. To ex-
press thanks and gratitude in the face of this extraordinary example of coping seems pale
and inadequate. Nevertheless, I here express both.
Fred N. Kerlinger
Eugene, Oregon
June 1985
Contents
Preface VII
PART ONE
The Language and Approach of Science
Chapter 1 Science and the Scientific Approach 3
Science and common sense. Four methods of knowing. The aims of science,
scientific explanation, and theory. Scientific research —A definition. The
scientific approach.
Chapter 5 Relations 55
Relations as sets of ordered pairs. Determining relations in research. Rules
of correspondence and mapping. Some ways to study relations. Multivariate
relations and regressions. Study suggestions.
PART THREE
Probability, Randomness, and Sampling
Chapter 7 Probability 89
Sample space, sample points, and events. Determining probabilities with
coins. An experiment with dice. Some formal theory. Compound events and
their probabilities. Independence, mutual exclusiveness, and exhaustiveness.
Conditional probability. Study suggestions.
PART FOUR
Analysis, Interpretation, Statistics, and Inference
Chapter 9 Principles of Analysis and Interpretation 125
Frequencies and continuous measures. Rules of categorization. Kinds of
statistical analysis. Indices. Social indicators. The interpretation of research
data. Study suggestions.
PART FIVE
Analysis of Variance
Chapter 13 Analysis of Variance: Foundations 203
Variance breakdown: A simple example. The /-ratio approach. The analysis
of variance approach. An example of a statistically significant difference.
Calculation of one-way analysis of variance.A research example. Strength of
and the analysis of variance. Broadening the structure:
relations: Correlation
Planned comparisons and post hoc tests. Study suggestions.
PART SIX
^^ Designs of Research
Chapter 17 Research Design: Purpose and Principles 279
Purposes of research design. Research design as variance control.
Maximization of experimental variance. Control of extraneous variables.
Minimization of error variance. Study suggestions.
xvi • Contents
PART SEVEN
Types of Research
Chapter 22 Nonexperimental Research 347
Basic differences between experimental and nonexperimental research. Self-
selection and nonexperimental research. Large-scale nonexperimental
research. Smaller scale nonexperimental research. Testing alternative
hypotheses. Conclusions. Addendum. Study suggestions.
PART EIGHT
Measurement
Chapter 25 Foundations of Measurement 391
Definition of measurement. Measurement and "reality" isomorphism.
Properties, constructs, and indicants of objects. Levels of measurement and
scaling. Comparisons of scales: Practical considerations and statistics.
PART NINE
Methods of Observation and Data Collection
Introduction
PART TEN
Multivariate Approaches and Analysis
Introduction
Appendixes
Appendix A: Historical and Methodological Research 620
Appendix B: The Computer and Behavioral Research 625
Appendix C: Random Numbers and Statistics 637
Appendix D; The Research Report 645
Name Index 653
Subject Index 659
Foundations of
Behavioral Research
PART ONE
^ THE Language
AND APPROACH
OF SCIENCE
Chapter i
^Sii^.
To UNDERSTAND any complex human activity we must grasp the language and approach
of the individuals who pursue it. So it is with understanding science and scientific re-
search. One must know and understand, at least in part, scientific language and the
scientific approach to problem-solving.
One of the most confusing things to the student of science is the special way scientists
use ordinary words. To make matters worse, they invent new words. There are good
reasons for this specialized use of language; they will become evident later. For now,
suffice it to say that we must understand and learn the language of social scientists. When
investigators tell us about their independent and dependent variables, we must know what
they mean. When they tell us that they have randomized their experimental procedures,
we must not only know what they mean — we must understand why they do as they do.
Similarly, the scientist's approach to problems must be clearly understood. It is not so
much approach is different from the layman's. It is different, of course, but it is
that this
not strange and esoteric. Quite the contrary. When understood, it will seem natural and
almost inevitable that the scientist does what he does. Indeed, we will probably wonder
why much more human thinking and problem-solving are not consciously structured along
such lines.
The purpose of Part One of this book is to help the student learn and understand the
language and approach of science and research. In the chapters of this part many of the
basic constructs of the social and educational scientist will be studied. In some cases it
will not be possible to give complete and satisfactory definitions because of lack of
background at this early point in our development. In such cases we shall attempt to
formulate and use reasonably accurate first approximations to later, more satisfactory
4 • The Language and Approach of Science
definitions. Let us begin our study by considering how the scientist approaches problems
and how this approach differs from what might be called a commonsense approach.
fanciful explanations of natural and human phenomena. An illness, for instance, may be
thought to be a punishment for sinfulness. An economic depression may be attributed to
Jews. Scientists, on the other hand, systematically build theoretical structures, test them
for internal consistency, and subject aspects of them to empirical test. Furthermore, they
realize that the concepts they use are man-made terms that may or may not exhibit a close
relation to reality.
Second, scientists systematically and empirically test their theories and hypotheses.
Nonscientists test "hypotheses,"" too, but they test them in what may be called a selective
fashion. They often "select"" evidence simply because it is consistent with the hypothe-
ses. Take the stereotype; Blacks are musical. If people believe this, they can easily
"verify"" the belief by noting that many blacks are musicians. Exceptions to the stereo-
type, the unmusical or tone-deaf black, for example, are not perceived. Sophisticated
social scientists, knowing this "selection tendency"" to be a common psychological phe-
nomenon, carefully guard their research against their own preconceptions and predilec-
tions and against selective support of hypotheses. For one thing, they are not content with
armchair exploration of relations; they must test the relations in the laboratory or in the
'
A. Whitehead, An New York: Holt, Rinehart and Winston, 1911, p. 157.
Introduction to Mathematics.
~J. Conant. Science and Common Sense. Haven: Yale University Press, 1951. pp. 32-33, A concept
New
is a word that expresses an abstraction formed by generalization from particulars. "Aggression" is a concept, an
abstraction that expresses a number of particular actions having the sim.ilar characteristic of hurting people or
objects. A conceptual scheme is a set of concepts interrelated by hypothetical and theoretical propositions. (See
ibid., pp. 25, 47-48.) A construct is a concept with the additional meaning of having been created or appropri-
field. They are not content, for example, with the presumed relations between methods of
leaching and achievement, between intelligence and creativity, between values and ad-
ministrative decisions. They insist upon systematic, controlled, and empirical testing of
these relations.
A third difference lies in the notion of control. In scientific research, control means
several things. For the present, let it mean that the scientist tries systematically to rule out
"
variables that are possible "causes" of the effects under study other than the variables
hypothesized to be the "causes." Laymen seldom bother to control systematically their
explanations of observed phenomena. They ordinarily make little effort to control extra-
neous sources of influence. They tend to accept those explanations that are in accord with
their preconceptions and biases. If they believe that slum conditions produce delinquency,
they tend to disregard delinquency in nonslum neighborhoods. The scientist, on the other
hand, seeks out and "controls" delinquency incidence in different kinds of neighbor-
hoods. The difference, of course, is profound.
Another difference between science and common sense is perhaps not so sharp. It was
is constantly preoccupied with relations among phenomena.
said earlier that the scientist
So is the layman who invokes common sense for his explanations of phenomena. But the
scientist consciously and systematically pursues relations. The layman's preoccupation
with relations is loose, unsystematic, uncontrolled. He often seizes, for example, on the
fortuitous occurrence of two phenomena and immediately links them indissolubly as cause
and effect.
Take the relation tested in a study done many years ago by Hurlock.^ In more recent
terminology, this relation may be expressed: Positive reinforcement (reward) produces
greater increments of learning than does punishment. The relation is between reinforce-
ment reward and punishment) and learning. Educators and parents of the nineteenth
(or
century often assumed that punishment was the more effective agent in learning. Educa-
tors and parents of the present often assume that positive reinforcement (reward) is more
effective. Both may say that their viewpoints are "only common sense." It is obvious,
they may say, that if you reward (or punish) a child, he or she will learn better. The
scientist, on the other hand, while personally espousing one or the other or neither of these
viewpoints, would probably insist on systematic and controlled testing of both (and other)
relations, as Hurlock did.
A final difference between common sense and science lies in different explanations of
observed phenomena. The scientist, when attempting to explain the relations among ob-
\ served phenomena, carefully rules out what have been called "metaphysical explana-
^ y^ tions." A metaphysical explanation is simply a proposition that cannot be tested. To say,
r> ^ for example, that people are poor and starving because God wills it, or that it is wrong to
0^ ,jc:^ be authoritarian, is to talk metaphysically.
~
1^ None of thesepropositions can be tested; thus they are metaphysical. As such, science
is not concerned with them. This does not mean that scientists would necessarily spurn
such statements, say they are not true, or claim they are meaningless. It simply means that
as scientists they are not concerned with them. In short, science is concerned with things
that can be publicly observed and tested. If propositions or questions do not contain
implications for such public observation and testing, they are not scientific propositions or
questions.
'E. Hurlock. ""An Evaluation of Certain Incentives Used in Schoolwork," Juurnat of Educalional Psychol-
ogy. 16 (1925). 145-159.
6 • The Language and Approach of Science
they know to be true because they hold firmly to it, because they have always known it
to be true. Frequent repetition of such "truths" seems to enhance their validity. People
often cling to their beliefs in the face of clearly conflicting facts. And they will also infer
"new" knowledge from propositions that may be false.
A second method of knowing or fixing belief is the method of authority This is the .
method of established belief. If the Bible says it, it is so. If a noted physicist says there is
a God, it is so. If an idea has the weight of tradition and public sanction behind it, it is
so. As Peirce points out, this method is superior to the method of tenacity, because human
progress, although slow, can be achieved using the method. Actually, life could not go on
without the method of authority. We must take a large body of facts and information on
the basis of authority. Thus, it should not be concluded that the method of authority is
unsound; it is unsound only under certain circumstances.
Tfie a priori method is the third way of knowing or fixing belief. (Cohen and Nagel
call it the method of intuition.) It rests its case for superiority on the assumption that the
propositions accepted by the "a priorist" are self-evident. Note that a priori propositions
"agree with reason" and not necessarily with experience. The idea seems to be that
people, through free communication and intercourse, can reach the truth because their
natural inclinations tend toward truth. The difficulty with this position lies in the expres-
sion "agree with reason." Whose reason? Suppose two honest and well-meaning individ-
uals, using rational processes, reach different conclusions, as they often do. Which one is
right? Is it a matter of taste, as Peirce puts it? If something is self-evident to many
people — for instance, that learning hard subjects trains the mind and builds moral charac-
ter, that American education is inferior to Russian and European education does this —
mean it is so? According to the a priori method, it does it just "stands to reason." —
The fourth method is the method of science. Peirce says:
To satisfy our doubts, . . . therefore, it is necessary that a method should be found by which our
beliefs may be determined by some external permanency by some-
nothing human, but by —
thing upon which our thinking has no The method must be such that the ultimate
effect. . . .
conclusion of every man shall be the same. Such is the method of science. Its fundamental
hypothesis ... is this; There are real things, whose characters are entirely independent of our
opinions about them. . .
.'
The scientific approach* has a characteristic that no other method of attaining knowl-
edge has: self-correction. There are built-in checks all along the way to scientific knowl-
edge. These checks are so conceived and used that they control and verify scientific
activities and conclusions to the end of attaining dependable knowledge. Even if a hypoth-
esis seems to be supported in an experiment, the scientist will test alternative plausible
hypotheses that, if also supported, may cast doubt on the first hypothesis. Scientists do not
accept statements as true, even though the evidence at first looks promising. They insist
upon testing them. They also insist that any testing procedure be open to public inspec-
tion.
*). Buchler, ed.. Philosophical Writings of Peirce. New York: Dover, 1955. chap. 2. In the ensuing
discussion, I am taking some liberties with Peirce's original formulation in an attempt to clarify the ideas and to
make them more germane to the present work. For a good discussion of the four methods, see M. Cohen and
E. Nagel,An Introduction to Logic and Scientific Method. New York: Harcourt, 1934. pp. 193-196.
'Buchler, op. cit.. p. 18.
^This book's position is that there is no one scientific method as such. Rather, there are a number of
methods that scientists can and do use, but it can probably be said that there is one scientific approach.
Science and the Scientific Approach • 7
As Peirce says, the checks used in scientific research are anchored as much as possible
in reahty lying outside the scientist's personal beliefs, perceptions, biases, values, atti-
tudes, and emotions. Perhaps the best singleword to express this is "objectivity." Objec-
tivity is agreement among "expert" judges on what is observed or what is to be done or
has been done in research.^ But, as we shall see later, the scientific approach involves
more than this. The point is that more dependable knowledge is attained because science
ultimately appeals to evidence: propositions are subjected to empirical test. An objection
may be raised: Theory, which scientists use and exalt, comes from people, the scientists
themselves. But, as Polanyi points out, "A theory is something other than myself";** thus
a theory helps the scientist to attain greater objectivity. In short, scientists systematically
and consciously use the self-corrective aspect of the scientific approach.
efficient.
These notions impede student understanding of science, the activities and thinking of
the scientist, and scientific research in general. In short, they make the student's task
harder than it would otherwise be. Thus they should be cleared away to make room for
more adequate notions.
There are two broad views of science: the static and the dynamic.^ The static view, the
view that seems to influence most laymen and students, is that science is an activity that
contributes systematized information to the world. The scientist's job is to discover new
facts and to add them to the already existing body of information. Science is even con-
ceived to be a body of facts. In this view, science is also a way of explaining observed
phenomena. The emphasis, then, is on the present state of knowledge and adding to it and
on the present set of laws, theories, hypotheses, and principles.
^For discussions of objectivity, its meaning, and its controversial character, see F. Kerlinger. Behavioral
Research: A Conceptual Approach. New York: Holt, Rinehart and Winston. 1979, pp. 8-13; 262-264.
*M. Polanyi, Personal Knowledge. Chicago: University of Chicago Press, 1958, p. 4.
'Conant, op. cit.. pp. 23-27.
8 • The Language and Approach of Science
The dynamic view, on the other hand, regards science more as an activity, what
scientists do. The present state of knowledge is important, of course. But it is important
mainly because it is a base for further scientific theory and research. This has been called a
heuristic view. The word "heuristic," meaning serving to discover or reveal, now has the
notion of self-discovery connected with it. A heuristicmethod of teaching, for instance,
emphasizes students' discovering things for themselves. The heuristic view in science
emphasizes theory and interconnected conceptual schemata that are fruitful for further
research. A heuristic emphasis is a discovery emphasis.
It is the heuristic aspect of science that distinguishes it in good part from engineering
and technology. On the basis of a heuristic hunch, the scientist takes a risky leap. As
Polanyi says, "It by which we gain a foothold at another shore of reality. On
is the plunge
such plunges the scientist has to stake bit by bit his entire professional life."'° Heuristic
may also be called problem-solving, but the emphasis is on imaginative and not routine
problem-solving. The heuristic view in science stresses problem-solving rather than facts
and bodies of information. Alleged established facts and bodies of information are impor-
tant to the heuristic scientist because they help lead to further theory, further discovery,
and further investigation.
Still avoiding a direct definition of science — but certainly implying one — we now
look at the function of science. Here we find two distinct views. The practical man, the
nonscientist generally, thinks of science as a discipline or activity aimed at improving
things, at making progress. Some The function of
scientists, too, take this position.
science, in this view, is to make knowledge in order
discoveries, to learn facts, to advance
to improve things. Branches of science that are clearly of this character receive wide and
strong support. Witness the continuing generous support of medical and military research.
The criteria of practicality and "payoff" are preeminent in this view, especially in educa-
tional research."
A very different view of the function of science is well expressed by Braithwaite:
"The function of science ... is to establish general laws covering the behaviors of the
empirical events or objects with which the science in question concerned, and thereby to is
enable us to connect together our knowledge of the separately events, and to make known
reliable predictions of events as yet unknown.
"' The connection between this view of the
function of science and the dynamic-heuristic view discussed earlier is obvious, except
that an important element is added: the establishment of general laws or theory, if you —
will. If we are to understand modem behavioral research and its strengths and weak-
than try to explain children's methods of solving arithmetic problems, for example, he
seeks general explanations of all kinds of problem-solving. He might call such a general
explanation a theory of problem-solving.
This discussion of the basic aim of science as theory may seem strange to the student,
who has probably been inculcated with the notion that human activities have to pay off in
practical ways. If we said that the aim of science is the betterment of mankind, most
readers would quickly read the words and accept them. But the basic aim of science is not
the betterment of mankind. It is theory. Unfortunately, this sweeping and really complex
statement is not easy to understand. Still, we must try because it is important."
Other aims of science that have been stated are: explanation, understanding, predic-
tion, and control. If we accept theory as the ultimate aim of science, however, explanation
and understanding become subaims of the ultimate aim. This is because of the definition
and nature of theory:
i' A theory is a set of interrelated constructs (concepts), definitions, and propositions that present
a systematic view of phenomena by specifying relations among variables, with the purpose of
explaining and predicting the phenomena.
This definition says three things. One, a theory is a set of propositions consisting of
defined and interrelated constructs. Two, a theory sets out the interrelations among a set of
variables (constructs), and in so doing, presents a systematic view of the phenomena
described by the variables. Finally, a theory explains phenomena. It does so by specifying
what variables are related to what variables and how they are related, thus enabling the
researcher to predict from certain variables to certain other variables.
One might, for example, have a theory of school failure. One's variables might be
intelligence, verbal and numerical aptitudes, anxiety, social class membership, and
achievement motivation. The phenomenon to be explained, of course, is school failure
or, perhaps more accurately, school achievement. School failure is explained by specified
relations between each of the six variables and school failure, or by combinations of the
six variables and school failure. The scientist, successfully using this set of constructs,
then "understands"" school failure. He is able to "explain"' and, to some extent at least,
"predict" it.
It is obvious that explanation and prediction can be subsumed under theory. The very
be concerned with explanation and understanding. Only prediction and control are neces-
sary. Proponents of this point of view may say that the adequacy of a theory is its
"See Kerlinger. Behavioral Research: A Conceptual Approach, op. cit.. pp. 15-18, chap. 16.
"Even this statement must be qualified. See R. Nisbett and L. Ross, Human Inference: Strategies and
Shortcomings of Social Judgment. Englewood Cliffs, N.J.; Prentice-Hall, 1980, pp. lOlff.
10 • The Language and Approach of Science
predictive power. If by using the theory we are able to predict successfully, then the
theory is confirmed and this is enough. We
need not necessarily look for further underly-
ing explanations. Since we can predict reliably, we can control because control is deduci-
ble from prediction.
The prediction view of science has validity. But as far as this book is concerned,
prediction is considered to be an aspect of theory. By its very nature, a theory predicts.
That is, when from the primitive propositions of a theory we deduce more complex ones,
we are in essence "predicting." When we explain observed phenomena, we are always
stating a relation between, say, the class A and the class B. Scientific explanation inheres
in specifying the relations between one class of empirical events and another, under
certain conditions. We say: If A, then B, A and B referring to classes of objects or
events.'^ But this is prediction, prediction from A to B. Thus a theoretical explanation
implies prediction. And we come back to the idea that theory is the ultimate aim of
science. All else flows from theory.
There is no intention here to discredit or denigrate research that is not specifically and
consciously theory-oriented. Much valuable social scientific and educational research is
preoccupied with the shorter-range goal of finding specific relations; that is, merely to
discover a relation is part of science. The ultimately most usable and satisfying relations,
however, are those that are the most generalized, those that are tied to other relations in
a theory.
The notion of generality is important. Theories, because they are general, apply to
many phenomena and to many people in many places. A specific relation, of course, is
less widely applicable. example, one finds that test anxiety is related to test perfor-
If, for
mance, this finding, though interesting and important, is less widely applicable and less
understood than if one first found the relation in a network of interrelated variables that are
parts of a theory. Modest, limited, and specific research aims, then, are good. Theoretical
research aims are better because, among other reasons, they are more general and more
widely applicable.
This definition requires little explanation since it is mostly a condensed and formalized
statement of much that was said earlier or that will be said soon. Two points need empha-
sis, however. First, when we say that scientific research is systematic and controlled, we
mean, in effect, that scientific investigation is so ordered that investigators can have
critical confidence in research outcomes. As we shall see later, scientific research obser-
vations are tightly disciplined. Moreover, among the many alternative explanations of a
phenomenon, all but one are systematically ruled out. One can thus have greater confi-
dence that a tested relation is as it is than if one had not controlled the observations and
ruled out alternative possibilities.
Second, scientific investigation is empirical. If the scientist believes something is so,
he must somehow or other put his belief to a test outside himself. Subjective belief, in
other words, must be checked against objective reality. The scientist must alv/'ays subject
his notions to the court of empirical inquiry and test. He is hypercritical of the results of
his own and others' research. Every scientist writing a research report has other scientists
reading what he writes while he writes it. Though it is easy to err, to exaggerate, to
overgeneralize when writing up one's own work, it is not easy to escape the feeling of
scientific eyes constantly peering over one's shoulder.
Prohlem-Obstacle-Idea
The scientist will usually experience an obstacle to understanding, a vague unrest about
observed and unobserved phenomena, a curiosity as to why something is as it is. His first
and most important step is to get the idea out in the open, to express the problem in some
reasonably manageable form. Rarely or never will the problem spring full-blown at this
stage. He must struggle with it, try it out, live with it. Dewey says, "There is a troubled,
perplexed, trying situation, where the difficulty is, as it were, spread throughout the entire
"'^
situation, infecting it as a whole. Sooner or later, explicitly or implicitly, he states the
problem, even if it is inchoate and tentative. Here he intellectualizes, as
his expression of
Dewey puts it, "what merely an emotional quality of the whole situation." '* In
at first is
some respects, this is the most difficult and most important part of the whole process.
Without some sort of statement of the problem, the scientist can rarely go further and
expect his work to be fruitful.
Hypothesis
After intellectualizing the problem, after turning back on experience for possible solu-
tions, after observing relevant phenomena, the scientist may formulate a hypothesis. A
hypothesis is a conjectural statement, a tentative proposition about the relation between
two or more phenomena or variables. Our scientist will say, "If such-and-such occurs,
then so-and-so results."
Reasoning-Deduction
quences of the hypothesis he has formulated. Conant, in talking about the rise of modem
science, said that the new element added in the seventeenth century was the use of
deductive reasoning.''' Here is where experience, knowledge, and perspicacity are im-
portant.
Often the when deducing the consequences of a hypothesis he has formu-
scientist,
lated, will arrive at a problem quite different from the one he started with. On the other
hand, he may find that his deductions lead him to believe that the problem cannot be
solved with present technical tools. For example, before modem statistics was developed,
certain behavioral research problems were insoluble. It was difficult, if not impossible, to
test two or three interdependent hypotheses at one time. It was next to impossible to test
the interactive effect of variables. And we now have reason to believe that certain prob-
lems are insoluble unless they are tackled in a multivariate manner. An example of this is
teaching methods and their relation to achievement and other variables. It is likely that
teaching methods, per se. do not differ much if we study only their simple effects.
Teaching methods probably work differently under different conditions, with different
teachers, and with different pupils. It is said that the methods "interact" with the condi-
tions and with the characteristics of teachers and of pupils.
An example may help us understand this reasoning-deduction step. Suppose an inves-
tigator becomes intrigued with aggressive behavior. He wonders why people are often
aggressive in situations where aggressiveness may not be appropriate. He has noted that
aggressive behavior seems to occur when people have experienced difficulties of one kind
or another. (Note the vagueness of the problem here.) After thinking for some time,
reading the literature for clues, and making further observations, he formulates a hypothe-
sis: Fmstration leads to aggression. He defines "frustration" as prevention from reaching
Observation-Test-Experiment
It should be clear by now that the observation-test-experiment phase is only part of the
scientific enterprise. If the problem has been well stated, the hypothesis or hypotheses
adequately formulated, and the implications of the hypotheses carefully deduced, this step
is almost automatic — assuming that the investigator is technically competent.
The essence of testing a hypothesis is to test the relation expressed by the hypothesis.
We do not test variables, as such; we test the relation between the variables. Observation,
testing, and experimentation are one large purpose: putting the problem relation to
for
empirical test. To test at least fairly well what and why one is testing is
without knowing
to blunder. Simply to state a vague problem, like How does Open Education affect
learning? and then to test pupils in schools presumed to differ in "openness," or to ask
What are the effects of cognitive dissonance? and then, after experimental manipulations
to create dissonance, to search for presumed effects can lead only to questionable -nfor-
mation. Similarly, to say one is going to study attribution processes without really know-
ing why one is doing it or without stating relations between variables is research nonsense.
Another point about testing hypotheses is that we usually do not test hypotheses
directly. As indicated in the previous step on reasoning, we test deduced implications of
hypotheses. Our test hypothesis may be: "Subjects induced to lie will comply more with
later requests than will subjects not induced to lie," which was deduced from a broader
and more general hypothesis: "Increased guilt leads to increased compliance." We do not
test "inducement to lie" nor "comply with requests." We test the relation between them,
in this case the relation between lying (deduced guilt) and compliance with later re-
quests.-"
Dewey emphasized that the temporal sequence of reflective thinking or inquiry is not
fixed. We
can repeat and reemphasize what he says in our own framework. The steps of
the scientific approach are not neatly fixed. The first step is not neatly completed before
the second step begins. Further, we may test before adequately deducing the implications
of the hypothesis. The hypothesis itself may seem to need elaboration or refinement as a
result ofdeducing implications from it.-'
Feedback to the problem, the hypotheses, and, finally, the theory of the results of
research is highly important. Learning theorists and researchers, for example, have fre-
quently altered their theories and research as a result of experimental findings.-- Theorists
and researchers have been studying the effects of early environment and training on later
development. Their research has yielded varied evidence converging on this extremely
important theoretical and practical problem.-' Part of the essential core of scientific re-
search is the constant effort to replicate and check findings, to correct theory on the basis
of empirical evidence, and to find better explanations of natural phenomena. One can
even go so far as to say that science has a cyclic aspect. A researcher finds, say, that A
is related to B in such-and-such a way. He then does more research to determine under
what other conditions A is similarly related to B. Other researchers challenge his theory
and his research, offering explanations and evidence of their own. The original re-
searcher, it is hoped, alters his work in the light of his own and others' evidence. The
process never ends.
Let us summarize the so-called scientific approach to inquiry. First there is doubt, a
barrier,an indeterminate situation crying out to be made determinate. The scientist expe-
riences vague doubts, emotional disturbance, inchoate ideas. He struggles to formulate
the problem, even if inadequately. He studies the literature, scans his own experience and
^"This hypotliesis was taken from an ingenious and interesting study: J. Freedman. S. Wallington, and
E. Bless. "Compliance Without Pressure: The Effect of Guilt." Journal of Personalis and Social Psychology.
7 (1967). 117-124.
-'
Hypotheses and their expression will often be found inadequate when implications are deduced from them.
A frequent difficulty occurs when a hypothesis is so vague that one deduction is as good as another — that is. the
hypothesis may not yield to precise test.
'E. Hilgard and G. Bower. Theories of Learning. 4th ed. Englewood Cliffs. N. J.: Prentice-Hall, 1975.
-'For example. E. Bennett et al.. "Chemical and Anatomical Plasticity of Brain," Science. 146 1964), (
the experience of others. Often he simply has to wait for an inventive leap of the mind.
Maybe it will occur; maybe not. With the problem formulated, with the basic question or
questions properly asked, the rest is much easier. Then the hypothesis is constructed, after
which its empirical implications are deduced. In this process the original problem, and of
course the original hypothesis, may be changed. It may be broadened or narrowed. It may
even be abandoned. Last, but not finally, the relation expressed by the hypothesis is tested
by observation and experimentation. On the basis of the research evidence, the hypothesis
is accepted or rejected. This information is then fed back to the original problem, and the
problem is kept or altered as dictated by the evidence. Dewey pointed out that one phase
of the process may be expanded and be of great importance, another may be skimped, and
there may be fewer or more steps involved. Research is rarely an orderly business any-
way. Indeed, it is much more disorderly than the above discussion may imply. Order and
disorder, however, are not of primary importance. What is much more important is the
controlled rationality of scientific research as a process of reflective inquiry, the interde-
pendent nature of the parts of the process, and the paramount importance of the problem
and its statement.
Study Suggestion
Some of the content of this chapter is highly controversial. The views expressed are accepted by
some thinkers and rejected by others. Readers can enhance understanding of science and its pur-
pose, the relation between science and technology, and basic and applied research by selective
reading of the literature. Such reading can be the basis for class discussions.
Extended treatment of the controversial aspects of science, especially behavioral science, is
given in my book. Behavioral Research: A Conceptual Approach. New York: Holt, Rinehart and
<^.
Problems and
Hypotheses
Many people think that science is basically a fact-gathering activity. It is not. As Cohen
says:
There is ... no genuine progress in scientific insight through the Baconian method of accu-
mulating empirical facts without hypotheses or anticipation of nature. Without some guiding
idea we do not know what facts to gather ... we cannot determine what is relevant and what
is irrelevant.'
The scientifically uninformed person often has the idea that the scientist is a highly
objective individual who gathers data without preconceived ideas. Poincare long ago
pointed out how wrong this idea is. He said:
It is often said that experiments should be made without preconceived ideas. That is impossible.
Not only would it mjike every experiment fruitless, but even if we wished to do so, it could not
be done."
PROBLEMS
It is not always possible for a researcher to formulate his problem simply, clearly, and
completely. He may often have only a rather general, diffuse, even confused notion of the
problem. This is in the nature of the complexity of scientific research. It may even take an
investigator years of exploration, thought, and research before he can clearly say what
questions he has been seeking answers to. Nevertheless, adequate statement of the re-
search problem is one of the most important parts of research. I'hat it may be difficult or
impossible to state a research problem satisfactorily at a given time should not allow us to
lose sight of the ultimate desirabilityand necessity of doing so.
Bearing mind, a fundamental principle can be stated; If one wants to
this difficulty in
solve a problem, one must generally know what the problem is. It can be said that a large
part of the solution lies in knowing what it is one is trying to do. Another part lies in
knowing what a problem is and especially what a scientific problem is.
What is a good problem statement? Although research problems differ greatly and
there is no one "right" way to state a problem, certain characteristics of problems and
problem statements can be learned and used to good advantage. To start, let us take two or
three examples of published research problems and study their characteristics. First, take
the problem of the study by Hurlock mentioned in Chapter 1 What are the effects on pupil
:
performance of different types of incentives?' Note that the problem is stated in question
form. The simplest way is here the best way. Also note that the problem states a relation
between variables, in this case between the variables incentives and pupil performance
(achievement). ("Variable" will be formally defined in Chapter 3. For now, a variable is
the name of a phenomenon, or a construct, that takes a set of different numerical values.
A problem, then, is an interrogative sentence or statement that asks: What relation
exists between two or more variables? The answer is what is being sought in the research.
A problem in most cases will have two or more variables. In the Hurlock example, the
problem statement relates incentive to pupil performance. Another problem, studied in an
ingenious experiment by Glucksberg and King, is associated with an adage: We remember
what we want to remember, and with Freud's concept of repression: Are memory items
associated with unpleasant events more readily forgotten than neutral items?"* One varia-
ble is items associated with unpleasantness, and the other variable is remembering (or
forgetting). Still another problem, by Jones and Cook, is quite different: Do attitudes
toward blacks influence judgments of the effectiveness of alternative racial social
policies?^ One variable is attitudes toward blacks and the other is judgments of the effec-
tiveness of social policies.^
There are three criteria of good problems and problem statements. One, the problem
should express a relation between two or more variables. It asks, in effect, questions like:
Is A related to B? How are A and B related to C? HowB under conditions C
is A related to
and D? The exceptions to this dictum occur mostly in taxonomic or methodological
research. (See Appendix A and footnote 6.)
^E. Hurlock. "An Evaluation of Certain Incentives Used in Schoolwork." Journal of Educational Psychol-
ogy, 16 (1925). 145-149. When citing problems and hypotheses from the literature, I have not always used the
words of the authors. In fact, the statements of many of the problems are mine and not those of the cited authors.
Some authors use only problem statements; some use only hypotheses; others use both.
•S. Glucksberg and L. King, "Motivated Forgetting Mediated by Implicit Verbal Chaining: A Laboratory
Analog of Repression." Science. 158 (1967). 517-519.
^S. Jones and S. Cook. "The Influence of Attitude on Judgments of the Effectiveness of Alternative Social
Policies." Journal of Personality and Social Psychology, 32 (1975), 767-773.
*Not all research problems clearly have two or more variables. For example, in experimental psychology,
the research focus is often on psychological processes like memory and categorization. In her justifiably well-
known and influential study of perceptual categories. Rosch in effect asked the question: Are there nonarbitrary
("natural") categories of color and form? (E. Rosch. "Natural Categories." Cognitive Psychology. 4 |1973J.
328-350.) Although the relation between two or more variables is not apparent in this problem statement, in the
actual research the categories were related to learning. Toward the end of this book we will see that factor
analytic research problems also lack the relation form discussed above. In most behavioral research problems,
however, the relations among two or more variables are studied, and we will therefore emphasize such relation
statements.
Problems and Hypotheses • 17
Two, the problem should be stated clearly and unambiguously in question form.
Instead of saying, for instance, "The problem is ," or "The purpose of this study
. . .
is .
," ask a question. Questions have the virtue of posing problems directly. The
. .
purpose of a study is not necessarily the same as the problem of a study. The purpose of
the Hurlock study, for instance, was to throw light on the use of incentives in school
situations. The problem was the question about the relation between incentives and per-
formance. Again, the simplest way is the best way: ask a question.
The third criterion is often difficult to satisfy. It demands that the problem and the
problem statement should be such as to imply possibilities of empirical testing. A problem
that does not contain implications for testing its stated relation or relations is not a scien-
tificproblem. This means not only that an actual relation is stated, but also that the
variables of the relation can somehow be measured. Many interesting and important
questions are not scientific questions simply because they are not amenable to testing.
Certain philosophic and theological questions, while perhaps important to the individuals
who consider them, cannot be tested empirically and are thus of no interest to the scientist
as a scientist. The epistemological question, "How do we know?," is such a question.
Education has many interesting but nonscientific questions, such as, "Does democratic
education improve the learning of youngsters?" "Are group processes good for chil-
dren?" These questions can be called metaphysical in the sense that they are, at least as
stated, beyond empirical testing possibilities. The key difficulties are that some of them
are not relations, and most of their constructs are very difficult or impossible to so define
that they can be measured.^
HYPOTHESES
A hypothesis is a conjectural statement of the relation between two or more variables.
Hypotheses are always in declarative sentence form, and they relate, either generally or
There are two criteria for "good" hypotheses and
specifically, variables to variables.
hypothesis statements. They are the same as two of those for problems and problem
statements. One, hypotheses are statements about the relations between variables. Two,
hypotheses carry clear implications for testing the stated relations. These criteria mean,
then, that hypothesis statements contain two or more variables that are measurable or
potentially measurable and that they specify how the variables are related.
Let us take three hypotheses from the literature and apply the criteria to them. The first
hypothesis seems to defy commonOverleaming leads to performance decrement
sense:
(or, as the authors say: Practice makes imperfect!).^ Here a relation is stated between one
variable, overleaming. and another variable, performance decrement. Since the two vari-
ables are readily defined and measured, implications for testing the hypothesis, too, are
readily conceived. The criteria are satisfied. A second hypothesis is related to the first
(though formulated many years earlier). It is also unusual because it states a relation in the
so-called null form: Practice in a mental function has no effect on the future learning of
that function.^ The relation is stated clearly: one variable, practice in a mental function
^Webb. working from a different point of view, has proposed the following criteria of research problems:
knowledge (of the researcher); dissatisfaction (skepticism, going against the tide, etc.); generality (wideness of
applicability). Webb's article is doubly valuable because he effectively disposes of irrelevant criteria, such as
conformability. cupidity ("payola"), conformity ("Everybody's doing it"). W. Webb. "The Choice of Prob-
lem." American Psychologist. 16 (1961). 223-227.
"E. Langer and L. Imber, "When Practice Makes Imperfect: Debilitating Effects of Overleaming," Jour-
nal of Personality and Social Psychology. 37 (1980). 2014-2024.
"A. Gates and G. Taylor, "An Experimental Study of the Nature of Improvement Resulting from Practice
in a Mental Function," Journal of Educational Psychology. 16 (1925). 583-592.
18 • The Language and Approach of Science
H. Blane, and B. Adams. "Reactions of Middle and Lower Class Children to Finger Paints as
'"T. Alper,
a Function of Class Differences in Child-Training Practices."
Journal of Abnormal and Social Psvchologv. 51
(1955). 439-448.
'
'
F. Kerlinger, 'The Attitude Structure of the Individual: A 2-Study of the Educational Attitudes of Profes-
'
which we set up to test the relation between A and B. We let the tacts have a chance to
establish the probable truth or falsity of the hypothesis.
Three, hypotheses are powerful tools for the advancement of knowledge because they
enable scientists to get outside themselves. Though constructed by man, hypotheses exist,
can be tested, and can be shown to be probably correct or incorrect apart from man's
values and opinions. This is so important that we venture to say that there would be no
science in any complete sense without hypotheses.
Just as important as hypotheses are the problems behind the hypotheses. As Dewey
has well pointed out, research usually starts with a problem. He says that there is first an
indeterminate situation in which ideas are vague, doubts are raised, and the thinker is
perplexed.'^ He further points out that the problem is not enunciated, indeed cannot be
enunciated, until one has experienced such an indeterminate situation.
The indeterminacy, however, must ultimately be removed. Though it is true, as stated
earlier, that a researcher may often have only a general and diffuse notion of his problem,
sooner or later he has to have a fairly clear idea of what the problem is. Though this
statement seems self-evident, one of the most difficult things to do is to state one's
research problem clearly and completely. In other words, you must know what you are
trying to find out. When you finally do know, the problem is a long way toward solution.
VIRTUES OF PROBLEMS
AND HYPOTHESES
Problems and hypotheses, then, have important virtues. One. they direct investigation.
The relations expressed in the hypotheses tell the investigator, in effect, what to do. Two.
problems and hypotheses, because they are ordinarily generalized relational statements,
enable the researcher to deduce specific empirical manifestations implied by the problems
and hypotheses. We may say. following Allport and Ross: If it is indeed true that people
of extrinsic religious orientation (they use religion) are prejudiced, whereas people of
intrinsic religious orientation (they live religion) are not, then it follows that churchgoers
should be more prejudiced than nonchurchgoers. They should perhaps also have a "jun-
gle" philosophy: general suspicion and distrust of the world.''*
There are important differences between problems and hypotheses. Hypotheses, if
properly stated, can be tested. While a given hypothesis may be too broad to be tested
directly, if it is a "good" hypothesis, then other testable hypotheses can be deduced from
it. Facts or variables are not tested as such. The relations stated by the hypotheses are
"J. Dewey. Logic: The Theory of Inquiry. New York: Holt, Rinehart and Winston, 1938. pp. 105-107.
"G. Allport and J. Ross. "Personal Religious Orientation and Prejudice.'" Journal of Personality and
Social Psychology. 5 (1967), 432-443.
'^D. Katz and R. Kahn. The Social Psychology of Organizations. 2nd ed. New York: Wiley, 1978, pp.
237ff.
"M. Rosenzweig. "Environmental Complexity, Cerebral Change, and Behavior." American Psychologist,
21 (1966). 321-332.
20 •
The Language and Approach of Science
the theory. These deductions have to include the repression notion, which
will, of course,
includes the construct of the unconscious. Hypotheses can be formulated using these
constructs; in order to test the theory, they have to be so formulated. But testing them is
another, more difficult matter because of the extreme difficulty of so defining terms such
as "repression"" and 'unconscious'" that they can be measured. To the present, no one
has succeeded in defining these two constructs without seriously departing from the origi-
nal Freudian meaning and usage. Hypotheses, then, are important bridges between theory
and empirical inquiry.
''The words "prove" and "disprove" are not lo be taken here in Iheir usual literal sense. It should be
remembered that a hypothesis is never really proved or disproved. To be more accurate we should probably say
something like: The weight of evidence is on the side of the hypothesis, or the weight of the evidence casts doubt
on the hypothesis. Braithwaite says: "Thus the empirical evidence of its instance never proves the hypothesis: in
suitable cases we may say that it establishes the hypothesis, meaning by this that the evidence makes it reason-
able to accept the hypothesis; but it never proves the hypothesis in the sense that the hypothesis is a logical
consequence of the evidence." (R. Braithwaite, Scientific Explanation. Cambridge: Cambridge University
Press, 1955. p. 14.)
Problems and Hypotheses • 21
hypothesis. If it were possible to state a relation between the variables, and if it were
possible to define the variables so as to permit testing the relation, then we might have a
hypothesis. But there is no way to test value questions scientifically.
A quick and relatively easy way to detect value questions and statements is to look for
words such as "should. "" "ought, " "better than" (instead of "greater than"), and simi-
lar words that indicate cultural or personal judgments or preferences. Value statements,
however, arc tricky. While a "should" statement is obviously a value statement, certain
other kinds of statements are not so obvious. Take the statement; Authoritarian methods of
teaching lead to poor learning. Here there is a relation. But the statement fails as a
scientific hypothesis because it uses two value expressions or words, "authoritarian meth-
ods of teaching" and "poor learning," neither of which can be defined for measurement
purposes without deleting the words "authoritarian" and "poor.""^
Other kinds of statements that are not hypotheses or are poor ones are frequently
formulated, especially in education. Consider, for instance; The core curriculum is an
enriching experience. Another type, too frequent, is the vague generalization; Reading
skills can be identified in the second grade; the goal of the unique individual is self-realiz-
ation; Prejudice is related to certain personality traits.
Another common defect of problem statements often occurs in doctoral theses; the
listing of methodological points or "problems" as subproblems. These methodological
points have two characteristics that make them easy to detect; ( 1 they are not substantive
)
problems that spring from the basic problem; and (2) they relate to techniques or methods
of sampling, measuring, or analyzing. They are usually not in question form, but rather
contain the words "test." "determine," "measure," and the like. "To determine the
reliability of the instruments used in this research," "To test the significance of the
means." and "To assign pupils at random to the experimental
differences between the
groups" are examples of this mistaken notion of problems and subproblems.
'"An almost classic case of the use of the word "authoritarian" is the statement sometimes heard among
educators: The lecture method is authoritarian. This seems to mean that the speaker does not like the lecture
method and he is telling us that it is bad. Similarly, one of the most effective ways to criticize a teacher is to
say that he is authoritarian.
22 • The Language and Approach of Science
ism." and the like have, at the present time at least, no adequate empirical referents.'''
Now, it is quite true that we can define "creativity," say, in a limited way by specifying
one or two creativity tests. This may be a legitimate procedure. Still, in so doing, we run
the risk of getting far away from the original term and its meaning. This is particularly true
when we speak of artistic creativity. We are often willing to accept the risk in order to be
able to investigate important problems, of course. Yet terms like "democracy" are almost
hopeless to define. Even when we do define it, we often find we have destroyed the
original meaning of the term.-"
The other extreme is too great specificity. Every student has heard that it is necessary
to narrow problems down to workable size. This is true. But, unfortunately, we can also
narrow the problem out of existence. In general, the more specific the problem or hypoth-
esis the clearer are its testing implications. But triviality may be the price we pay. While
researchers cannot handle problems that are too broad because they tend to be too vague
for adequate research operations, in their zeal to cut the problems down to workable size
or to find a workable problem, they may cut the life out of it. They may make it trivial or
inconsequential. A thesis, for instance, on the simple relation between the speed of read-
ing and size of type, while important and maybe even interesting, is too thin for doctoral
study. Too great specificity is perhaps a worse danger than too great generality. At any
rate, some kind of compromise must be made between generality and specificity. The
ability effectively to make such compromises is a function partly of experience and partly
of critical study of research problems.
" Although many studies of authoritarianism have been done with considerable success, it is doubtful that
we know what authoritarianism in the classroom means. For instance, an action of a teacher that is authoritarian
in one classroom may not be authoritarian in another classroom. The alleged democratic behavior exhibited by
one teacher may even be called authoritarian if exhibited by another teacher. Such elasticity is not the stuff of
science.
-"An outstanding exception to this statement is Bollen's definition and measurement of "democracy." We
willexamine both in subsequent chapters. K. Bollen. "Issues in the Comparative Measurement of Political
Democracy." Aim'iican Saciologiccil Review. 45 (19801. 370-390.
Problems and Hypotheses • 23
logical, sociological, and educational reality. Although we will talk of one x and one y,
especially in the early part ot the book, it inusi be understood that contemporary behav-
ioral research, which used to be almost exclusively univariate in its approach, is rapidly
becoming multivariate. (For now, "univariate"" means one x and one y. "Univariate,"
strictly speaking, applies to y. If there is more than one x or more than one y, the word
"multivariate" is used, at least in this book.) We will soon encounter multivariate con-
ceptions and problems. And later parts of the book will be especially concerned with a
multivariate approach and emphasis.
the hypothesis is confirmed. This more powerful evidence than simply observing,
is
without prediction, the covarying of .v and y. It is more powerful in the betting-game sense
discussed earlier. The scientist makes a bet that x leads to y. If, in an experiment, x does
lead to y, then he has won the bet. He cannot just enter the game at any point and pick a
perhaps fortuitous common occurrence of and y. Games are not played this way (at least
-v
inour culture). He must play according to the rules, and the rules in science are made to
minimize error and fallibility. Hypotheses are part of the rules of the game.
Even when hypotheses are not confirmed, they have power. Even when y does not
covary with x, knowledge is advanced. Negative findings are sometimes as important as
positive ones, since they cut down the total universe of ignorance and sometimes point up
fruitful further hypotheses and lines of investigation. But the scientist cannot tell positive
from negative evidence unless he uses hypotheses. It is possible to conduct research
Study Suggestions
1. Use the following variable names to write research problems and hypotheses: frustration,
academic achievement, intelligence, verbal ability, race, social class (socioeconomic status), sex.
24 •
The Language and Approach of Science
(a) If people are given more time than necessary to do a task, will they continue to take
more time than necessary on subsequent similar tasks?''
(b) How does organizational climate affect administrative performance?"
(c) Is comprehension of text facilitated by constructing meaningful elaborations of the
text?"
(d) Do colleges discriminate against women applicants?""*
(e) How does equalization of extrinsic environmental factors and conditions affect the men-
tal performance of school children?"^
(f) Are "natural'" categories of color and form developed around basic prototypes (basic
colors, basic forms) more readily learned than less prototypical categories?-''
(g) What is the influence of massive rewards on the reading achievement of potential school
dropouts?-'
(h) Does extrinsic reward undermine intrinsic motivation?-*
3. Eight hypotheses are given below. Discuss possibilities of testing them. Then read two
or three of the studies to see how the authors tested them.
(a) The greater the cohesiveness of a group, the greater its influence on its members.-'
(b) The greater the state's control of the economic system, the lower the level of democracy
of the political system.'"
(c) Revolutionary leaders who were successful before and after the success of the revolu-
','
tionary movement exhibit a low level of conceptual complexity before the revolution
and a high level of complexity after its success.^'
(d) Role conflict is a function of incompatible expectations placed on or held by the indi-
'-
vidual.
(e) Prejudiced (anti-Semitic) subjects, when frustrated, will displace aggression on to indi-
viduals not necessarily related to the source of the frustration.'-'
-'E. Aronson and E. Gerard. "Beyond Parkinson's Law: The Effect of Excess Time on Subsequent Pertbr-
mance," Journal of Personality and Social Psychology, 3 (1966), 336-339.
--N. Frederiksen, O. Jensen, and A. Beaton, Organizational Climates and Administrative Performance.
Princeton, N. J.: Educational Testing Service, 1968.
-'M. Doctorow. M. Wittrock, and C. Marks, "Generative Processes in Reading Comprehension." Journal
of Educational Psychology. 70 (1978), 109-118.
-''E. Walster, T. Cleary, and M. Clifford, "The Effect of Race and Sex on College Admissions," Journal of
(f) Teachers whci are perceived by students as dissimilar (to the students) in traits relevant to
teaching are more attractive lo students than teachers perceived as similar.''*
(g) The more continuous and unlagging the provisions of lessons, the greater the task in-
volvement of pupils.'^
(h) Vivid information is better remembered than pallid information and is more likely to
influence subsequent inference.'^
--A. Multivariate (for now, more than two variables) problems and hypotheses have become
common in behavioral research. To give the student a preliminary feeling for such problems, we
here append several of them. Try to imagine how you would do research to study them.
(e) How do air pollution and socioeconomic status effect pulmonary mortality?*'
(f) Do primary candidates' campaign expenditures, regional exposure, and past perfor-
mance influence voting outcomes in primary elections?*"
(g) How do sex-role stereotyping, sexual conservatism, adversial sexual beliefs, and ac-
ceptance of interpersonal violence affect attitudes toward rape and sexual violence?'*''
(h) How is rank in the U.S. Civil Service related to social class, race, and sex?'*'*
(i) How do home conditions, classroom processes, and peer group environment influence
.science and mathematics achievement and attitudes?'"
(j) Does stimulus exposure have two effects, one cognitive and one affective, which in turn
affect liking, familiarity, and recognition confidence and accuracy?'**
'''J. Crush. G. Clore, and F. Costin, "Dissimilarity and Attraction: When Difference Makes a Difference,"
cational and Occupational Attainment," Journal of Personality and Social Psychology. 35 (1977), 365-380.
"*?. Cutright, "National Political Development: Measurement and Analysis." American Sociological Re-
view. 27 (1963), 229-245,
""K. Marjoribanks, "Ethnic and Environmental Influences on Mental Abilities," American Journal of
Sociology. 78 (1972), 323-337.
'"L. Lave and E. Seskin, "Air Pollution and Human Health," Science. 169 (1970), 723-733.
""J. Crush, "Impact of Candidate Expenditures, Regionality, and Prior Outcomes on the 1976 Democratic
Presidential Primaries," Journal of Personality and Social Psychology, 38 (1980), 337-347.
"M. Burt, "Cultural Myths and Supports for Rape," Journal of Personality and Social Psychology. 38
(1980), 217-230.
"K. Meier, "Representative Bureaucracy: An Empirical Analysis," /4»ifrican Political Science Review, 69
(1975). 526-542.
^J. Keeves, Educational Environment and Student Achievement. Melbourne: Australian Council for Educa-
tional Research, 1972.
""R. Zajonc. "Feeling and Thinking: Preferences Need No Inferences," American Psychologist. 35 1980), (
151-175. The last two problems and studies are quite complex because the relations stated are complex. The
other problems and studies, though also complex, have only one phenomenon presumably affected by other
phenomena, whereas the last two problems have several phenomena influencing two or more other phenomena.
Readers should not be discouraged if they find these problems a bit difficult. By the end of the book they should
appear interesting and natural.
Chapter 3
^ Constructs, Variables,
and Definitions
A
concept of more interest to readers of this book is "achievement." It is an abstrac-
tion formed from the observation of certain behaviors of children. These behaviors are
associated with the mastery or "learning" of school tasks —
reading words, doing arith-
metic problems, drawing pictures, and so on. The various observed behaviors are put
together and expressed in a word
—
"achievement." "Intelligence," "aggressiveness,"
"conformity," and "honesty" are all concepts used to express varieties of human be-
havior.
A
construct is a concept. It has the added meaning, however, of having been deliber-
ately and consciously invented or adopted for a special scientific purpose. "Intelligence"
is a concept, an abstraction from the observation of presumably intelligent and nonintel-
ligent behaviors. But as a scientific construct, "intelligence" means both more and less
than it may mean as a concept. It means that scientists consciously and systematically use
it in two ways. One, it enters into theoretical schemes and is related in various ways to
other constructs. We may say, for example, that school achievement is in part a function
of intelligence and motivation. Two, "intelligence" is so defined and specified that it can
be observed and measured. We can make observations of the intelligence of children by
administering X intelligence test to them, or we can ask teachers to tell us the relative
degrees of intelligence of their pupils.
VARIABLES
Scientistssomewhat loosely call the constructs or properties they study "variables."
Examples of important variables in sociology, psychology, and education are: sex, in-
come, education, social class, organizational productivity, occupational mobility, level of
aspiration, verbal aptitude, anxiety, religious affiliation, political preference, political
development (of nations), task orientation. anti-Semitism, conformity, recall memory,
recognition memory, achievement. It can be said that a variable is on
a property that takes
different values. Putting it redundantly, a variable is something that varies.While this
way of speaking gives us an intuitive notion of what variables are, we need a more general
and yet more precise definition.
A variablea symbol to which numerals or values are assigned. For instance, .v is a
is
variable: itsymbol to which we assign numerical values. The variable .v may take on
is a
any justifiable set of values —
for example, scores on an intelligence test or an attitude
scale. In the case of intelligence we assign to x a set of numerical values yielded by the
procedure designated in a specified test of intelligence. This set of values ranges from low
to high, from, say, 50 to 150.
A variable, x. however, may have only two values. If sex is the construct under study,
then X can be assigned and 0. 1 standing for one of the sexes and standing for the other.
1
'
Such dichotomies and polytomies have been called "qualitative variables." The questionable nature of
this designation will be discussed later.
28 • The Language and Approach of Science
gence. a continuous variable, has been broken down into high and low intelligence, or
into high, medium, and low intelligence. Variables such as anxiety, introversion, and
authoritarianism have been treated similarly. While it is not possible to convert a truly
dichotomous variable such as sex to a continuous variable, it is always possible to convert
a continuous variable to a dichotomy or a polytomy. As we will see later, such conversion
can serve a useful conceptual purpose, but is poor practice in the analysis of data because
it throws information away.
gives meaning to a variable by spelling out what the investigator must do to measure it.
Awell-known, if extreme, example of an operational definition is: Intelligence (anxi-
ety, achievement, and so forth) is scores on X intelligence test, or intelligence is what X
intelligence test measures. This definition tells us what to do to measure intelligence. It
says nothing about how well intelligence is measured by the specified instrument. (Pre-
-H. Margenau, The Nature of Physical Realin New York: McGraw-Hill. 1950, chaps. 4, 5, and 12. The
.
sumably the adequacy of the test was ascertained prior to the investigator's use of it.) In
this usage, an operational definition is an equation where we
"Let intelligence equal
say,
the scores on X test of intelligence." We also seem to be saying, "The meaning of
intelligence (in this research) is expressed by the scores on X intelligence test."
There are, in general, two kinds of operational definitions: (1) measured and (2) ex-
perimental. The definition given above is more closely tied to measured than to experi-
mental definitions. A measured operational definition describes how a variable will be
measured. For example, achievement may be defined by a standardized achievement test,
by a teacher-made achievement test, or by grades. Hiller, Fisher, and Kaess, studying
effective classroom lecturing, defined vagueness of lecturing by specifying words and
phrases that make lectures vague, for example, "a couple," "a few," "sometimes,"
"all of this," "and things," "not very," "pretty much." Videotapes of actual lectures
were analyzed using this "definition" of vagueness and other operationally defined verbal
variables, like interest, information, and verbal fluency.'* A study may include the varia-
ble consideration. It can be defined operationally by listing behaviors of children that are
presumably considerate behaviors and then requiring teachers to rate the children on a
five-point scale. Such behaviors might be when the children say to each other, "I'm
sorry," or "Excuse me," when one child yields a toy to another on request (but not on
threat of aggression), or when one child helps another with a school task.
An e.xperimental operational definition spells out the details (operations) of the inves-
tigator's manipulation of a variable. Reinforcement can be operationally defined by giving
the details of how subjects are to be reinforced (rewarded) and not reinforced (not re-
warded) for specified behaviors. In the Hurlock study discussed earlier, for example,
some children were praised, some blamed, and some ignored. Dollard et al. define frus-
tration as prevention from reaching a goal, or ". interference with the occurrence of
. .
an instigated goal response at its proper time in the behavior sequence. ."^ This . .
J. Hiller, G. Fisher, and W. Kaess. "A Computer Investigation of Verbal Characteristics of Effective
'J. Freedman, S. Wallington, and E. Bless, "Compliance Without Pressure: The Effect of Guilt," Journal
of Personalis and Social Psychology. 1 (1967), 17-124.
1
30 •
The Language and Approach of Science
Some scientists say that such limited operational meanings are the only meanings that
"mean" anything, that all other definitions are metaphysical nonsense. They say that
discussions of anxiety are metaphysical nonsense, unless adequate operational definitions
of anxiety are available and are used. This view is extreme, though it has healthy aspects.
To insist that every term we use in scientific discourse be operationally defined would be
too narrowing, too restrictive, and, as we unsound.^
shall see. scientifically
Despite the dangers of extreme operationism, can be safely said that operationism
it
has been and still is a healthy influence because, as Skinner puts it. '"The operational
attitude, in spite of its shortcomings, is a good thing in any science but especially in
psychology because of the presence there of a vast vocabulary of ancient and nonscientific
origin."'* When the terms used in education are considered, it is clear that education, too.
has a vast vocabulary of ancient and nonscientific terms. Consider these: the whole child,
horizontal and vertical enrichment, meeting the needs of the learner, core curriculum,
emotional adjustment, and curricular enrichment.
To clarify constitutive and operational definitions and theory, too —
look at Figure —
3.1. which has been adapted after Margenau and Torgerson. The diagram is supposed to
illustrate a well-developed theory. The single lines represent theoretical connections or
relations between constructs. These constructs, labeled with lower-case letters, are de-
fined constitutively; that is. (4 is defined somehow by (3. or vice versa. The double lines
represent operational definitions. The C constructs are directly linked to observable data;
they are indispensable links to empirical reality. But it is important to note that not all
constructs in a scientific theory are defined operationally. Indeed, it is a rather thin theory
that has all its constructs so defined.
Let us build a "small theory" of underachievement to illustrate these notions. Sup-
pose an investigator believes that underachievement is, in part, a function of pupils'
O.D.
O.D.
O.D.
Observable Data;
Empirical Reality
Figure 3.1
'For a good discussion ol this point, see F. Northrop. The Logic of the Sciences and ihc Hnnwniiies. New
York: Mcicniilian. 1947. chaps. VI and Vll. Northrop says, for example, "The importance of operational
definitions is that they make verification possible and enrich meaning. They do not, however, exhaust scientific
meaning" (p. 130). Margenau makes the same point in his extended discussion of scientific constructs. (See
Margenau. op. cil.. pp. 232ff.
*'B. Skinner. "The Operational Analysis of Psychological Terms." In H. Feigl and M. Brodbeck. eds.,
Reading.^ in the Philosophy of Science. New York: Applelon, 1953, p. 586.
Constructs, Variables, and Definitions • 31
Achievement
Aptitude
Test
Aptitude Observations
Figure 3.2
probably the most common method of measuring psychological (and educational) con-
structs. The heavy single line between ci and Ci indicates the relatively direct nature of
the presumed relation between self-concept and the test. (The double line between Ci and
the level of observation indicates an operational definition, as it did in Figure 3.1.)
Similarly, the construct achievement (C4) is operationally defined as the discrepancy be-
tween measured achievement (Cj) and measured aptitude (cs). In this model the investiga-
tor has no direct measure of achievement motivation, no operational definition of it. In
another study, naturally, he may specifically hypothesize a relation between achievement
and achievement motivation, in which case he will try to define achievement motivation
operationally.
—
A single solid line between concepts for example, the one between the construct
achievement (C4) and achievement test (C2) —
indicates a relatively well-established rela-
tion between postulated achievement and what standard achievement tests measure. The
single solid lines between Ci and C2 and between C2 and C3 indicate obtained relations
between the test scores of these measures. (The lines between Ci and C2 and between C2
and C3 are labeled r for "relation," or "coefficient of correlation.")
The broken single lines indicate postulated relations between constructs that are not
relatively well established. A good example of this is the postulated relation between
32 • The Language and Approach of Science
self-concept and achievement motivation. One of the aims of science is to make these
broken lines solid lines by bridging the operational definition-measurement gap. In this
case, it is quite conceivable that both self-concept and achievement motivation can be
operationally defined and directly measured.
In essence, this is the way the behavioral scientist operates. He
back and forth shuttles
between the level of theory-constructs and the this by
level of observation. He does
operationally defining the variables of his theory that are amenable to such definition and
then by estimating the relations between the operationally defined and measured varia-
bles. From these estimated relations he makes inferences as to the relations between the
constructs. In the above example, he calculates the relation between C[ (figure-drawing
test) and C2 (achievement test) and, if the relation is established on this observational
TYPES OF VARIABLES
of variation in the independent variable. In predicting from X to Y. we can take any value
of X we wish, whereas the value of Y we predict to is "dependent on" the value of X we
Constructs, Variables, and Definitions • 33
have selected. The dependent variable is ordinarily the condition we are trying to explain.
The most coninion dependent variable in is achievement or
education, tor instance,
"learning." We want to account for or explain achievement. In doing so we have a large
number of possible X's or independent variables to choose from.
When the relation between intelligence and school achievement is studied, intelli-
gence is the independent and achievement the dependent variable. (Is it conceivable that it
might be the other way around.') Other independent variables that can be studied in
relation to the dependent variable, achievement, are social class, methods of teaching,
personality types, types of motivation (reward and punishment), attitudes toward school,
class atmosphere, and so on. When the presumed determinants of delinquency are stud-
ied, such determinants as slum conditions, broken homes, lack of parental love, and the
like, are independent variables and, naturally, delinquency (more accurately, delinquent
behavior) is the dependent variable. In the frustration-aggression hypothesis mentioned
earlier, frustration is the independent variable and aggression the dependent variable.
Sometimes a phenomenon is studied by itself, and either an independent or a dependent
variable is implied. This is the case when teacher behaviors and characteristics are stud-
ied. The usual implied dependent variable is achievement or child behavior in general.
Teacher behavior can of course be a dependent variable.
The relation between an independent variable and a dependent variable can perhaps be
more clearly understood if we lay out two axes at right angles to each other, one axis
representing the independent variable and the other axis the dependent variable. (When
two axes are at right angles to each other, they are called orthogonal axes.) Following
mathematical custom, X. the independent variable, is the horizontal axis and Y, the de-
pendent variable, the vertical axis. {X is called the abscissa and Y the ordinate.) X values
are laid out on the X axis and Y values on the Y axis. A very common and useful way to
"see" and interpret a relation is to plot the pairs of XY values, using the X and Y axes as
a frame of reference. Let us suppose, in a study of child development, that we have two
sets of measures: the X measures chronological age, the Y measures reading age:**
Chronological Age
34 • The Language and Approach of Science
Constructs, Variables, and Definitions • 35
positively reinforces a certain kind of behavior — and does something different to another
group, or has the two groups follow different instructions, this is manipulation. When one
uses different methods of leaching, or rewards the subjects of one group and punishes
those of another, or creates anxiety through worrisome instructions, one is aclively manip-
ulating the variables methods, reinforcement, and anxiety.
Variables that cannot be manipulated are attribute variables. It is impossible, or at
least difficult, to manipulate many variables. All variables that are human characteristics
intelligence, aptitude, sex,socioeconomic status, conservatism, field dependence, need
for achievement, and attitudes, for example —
are attribute variables. Subjects ct)me to our
studies with these variables (attributes) ready-made. Early environment, heredity, and
other circumstances have made individuals what they are. '* The word "attribute," more-
over, is accurate enough when used with inanimate objects or referents. Organizations,
institutions, groups, populations, homes, and geographical areas have attributes. Organi-
zations are variably productive; institutions become outmoded; groups differ in cohesive-
ness; geographical areas vary widely in resources.
This active-attribute distinction is general, flexible, and useful. We will see that some
variables are by their very nature always attributes, but other variables that are attributes
can also be active. This latter characteristic makes it possible to investigate the "same"
relations in different ways. A good example, again, is the variable anxiety. We can
measure the anxiety of subjects. Anxiety is in this case obviously an attribute variable.
But we can manipulate anxiety, too. We can induce different degrees of anxiety, for
example, by telling the subjects of one experimental group that the task they are about to
do is difficult, that their intelligence is being measured, and that their futures depend on
the .scores they get. The subjects of another experimental group are told to do their best but
to relax, the outcome is not too important and will have no intluence on their futures.
Actually, we cannot assume that the measured (attribute) and the manipulated (active)
"anxieties" are the same. We may assume that both are "anxiety" in a broad sense, but
they are certainly not the same.
'"Such variables are also called organismic variables. Any property of an individual, any characteristic or
attribute, is anorganismic variable. It is part of the organism, so to speak. In other words, organismic variables
are those characteristics that individuals have in varying degrees when they come to the research situation. The
term individual differences implies organismic variables.
Another related classification, used mainly by psychologists, is slimulus and response variables. A stimulus
variable is any condition or manipulation by the experimenter of the environment that evokes a response in an
organism. A response variable is any kind of behavior of the organism. The assumption is made that for any
kind of behavior there is always a stimulus. Thus the organism's behavior is a response. This classification is
reflected in the well-known equation: R = fiO.S). which is read: "Responses are a function of the organism and
stimuli." or "Response variables are a function of organismic variables and stimulus variables."
36 • The Language and Approach of Science
scales in use in the behavioral sciences also have a third characteristic: there is a theoreti-
cally infinite set of values within the range. (Rank-order scales are somewhat different;
they will be discussed later in the book.) That is, may be
a particular individual's score
4.72 rather than simply 4 or 5.
Categorical variables, as 1 will call them, belong to a kind of measurement called
nominal. (It will be explained in Chapter 25.) In nominal measurement, there are two or
more subsets of the set of objects being measured. Individuals are categorized by their
possession of the characteristic that defines any subset. "To categorize"" means to assign
an object to a subclass or subset of a class or set on the basis of the object" s having or not
having the characteristic that defines the subset. The individual being categorized either
has the defining property or does not have it; it is an all-or-none kind of thing. The
simplest examples are dichotomous categorical variables: sex, Republican-Democrat,
white-black. Polytomies. variables with more than two subsets or partitions, are fairly
common, especially in sociology and economics: religious preference, education (usu-
ally), nationality, occupational choice, and so on.
Categorical variables — and nominal measurement — have simple requirements: all the
members of a subset are considered the same and all are assigned the same name (nomi-
nal) and the same numeral. If the variable is religious preference, for instance, all Protes-
tants are the same, all Catholics are the same, and all "others"" are the same. If an
individual is a Catholic —
operationally defined in a suitable way he is assigned to the —
category "Catholic"' and also assigned a "1"" in that category. In brief, he is counted as
a "Catholic."" Categorical variables are "democratic"': there is no rank order or greater-
than and less-than among the categories, and all members of a category have the same
value: 1.
individuals. If an individual, say, is a Catholic, then put him in the Catholic subset and
assign him a 1 . It is extremely important to understand this because, for one thing, it is
the basis of quantifying many variables — even experimental treatments — for complex
analysis. In multiple regression analysis, as we will see later in the book, all variables,
continuous and categorical, are entered as variables into the analysis. Earlier, the example
of sex was given, 1 being assigned to one sex and to the other. We can set up a column
of Ts and O's just as we would set up a column of dependency scores. The column of 1 "s
and O's is the quantification of the variable sex. There is no mystery here. The method is
"
Such variables have been called "dummy variables." Since they are highly useful and powerful, even
indispensable, in modern research data analysis, they should be clearly understood. See F. Kerlinger and
E. Pedhazur, Multiple Regression Analysis in Behavioral Research. New York: Holt. Rinehart and Wmston,
1973. chaps. 6 and 7. and Chapter 34 of this volume. A polytomy is a division of the members of a group into
three or more subdivisions.
Constructs, Variables, and Definitions • 37
defined, are obscrvables. The distinction is important, because if we are not always
keenly aware of the level of discourse we arc on when talking about variables, wc can
hardly be clear about what we are doing.
An important and fruitful expression, which we will encounter and use a good deal
later in the book, is ""latent variable." A latent variable isan unobserved "entity" pre-
sumed to underlie observed variables. The best-known example of an important latent
variable is We note, say, that three ability tests, verbal, numerical, and
'"intelligence.'"
spatial, are positivelyand substantially related. This means, in general, that persons high
on one tend to be high on the others: similarly, persons low on one tend to be low on the
others. We believe that something is common to the three tests or observed variables and
name this something ""intelligence." It is a latent variable.
We have encountered many examples of latent variables in previous pages: achieve-
ment, creativity, social class, anti-Semitism, conformity, and so on. Indeed, whenever we
utter the names of phenomena on which people or objects vary, we are talking about latent
variables. In science our real interest is more in the relations among latent variables than it
is in the relations among observed variables because we seek to explain phenomena and
their relations. When we enunciate a theory, we enunciate in part systematic relations
among latent variables. We are not too interested in the relation between observed frus-
trated behaviors and observed aggressive behaviors, for example, though we must of
course work with them at the empirical level. We are really interested in the relation
between the latent variable frustration and the latent variable aggression.
We must be cautious, however, when dealing with nonobservables. Scientists, using
such terms as ""hostility," ""anxiety," and ""learning," are aware that they are talking
about invented constructs the "reality" of which has been inferred from behavior. If
they want to study the effects of different kinds of motivation, they must know that
"motivation" is a latent variable, a construct invented to account for presumably ""moti-
vated" behavior. They must know that its "reality" is only a postulated reality. They can
only judge that youngsters are motivated or not motivated by observing their behavior.
Still, in order to study motivation, they must measure it or manipulate it. But they cannot
'""E. Tolman. Behavior and Psychological Man. Berkeley. Calif.: University of California Press. 1958. pp.
115-129.
38 •
The Language and Approach of Science
words, "latent variable" can be used with psychological, sociological, and other phe-
nomena. "Latent variable" seems to me to be the preferable term because of its generality
and because it is now possible in the analysis of covariance structures approach to assess
on each other and on so-called manifest or observed varia-
the effects of latent variables
bles.This rather abstract discussion will later be made more concrete and, it is hoped,
meaningful. We will then see that the idea of latent variables and the relations between
them is an e.\tremely important, fruitful, and useful one that is helping to change funda-
mental approaches to research problems.
Social Class"... two or more orders of people who are believed to be, and are
accordingly ranked by the members of a community, in socially superior and inferior
positions."'^ (M) To be operational, this definition has to be specified by questions aimed
at people's beliefs about other people's positions. This is a subjective definition of social
class. Social class, or social status, is also defined more objectively by using such indices
as occupation, income, and education, or by combinations of such indices. For example,
"... primary emphasis has been placed on the educational attainment and occupational
status of the head of the family in which an individual is reared.""* (M)
Achievement ISchool. Arithmetic. Spelling) Achievement is customarily defined op-
erationally by citing a standardized test of achievement (for example, Iowa Tests of Basic
Skills, Elementary), by grade-point averages, or by teacher judgments.
The criterion of school achievement, grade-point average . . . was generally obtained by as-
signing weights of 4, 3. 2. 1. and to grades of A. B. C. D. and F. respectively. Only courses
in the so-called "solids." that is, mathematics, science, social studies, foreign language, and
English, were used in computing grade-point averages.'* (M)
"W. Warner and P. Lum. The Social Life of a Modern Commiinin. New Haven; Yale University Press.
1941, p. 82.
"O. Duncan. D. Featherman, and B. Duncan, Socioeconomic Backgroiinii and Achievement New York; .
mately 4 centimeters beyond the anal sphincter. Changes in pressure in the balloon were
transduced into electric voltages which produced a record on a polygraph.""" (M)
Task Involvemi'ni ". each child's behavior during a lesson was coded every 6 sec
. .
as being appropriately involved, or deviant. The task involvement scores for a lesson was
the percentage of 6-sec units in which the children were coded as 'appropriately in-
volved."'-' (M)
Reinforcement Reinforcement definitions come in a number of forms. Most of them
involve, in oneway or another, the principle of reward. But both positive and negative
reinforcement may be used.
"... statements of agree-mf wf or parap/ir«ie.""- (E) Then the author gives specific
experimental definitions of "reinforcement." For example.
In the second 10 minutes, every opinion statement 5 made was recorded by E and reinforced.
For two groups, £ agreed with every opinion statement by saying: "Yes, you're right." "That's
so." or the like, or by nodding and smiling alTirmation if he could not interrupt. (E)
... the model and the child were administered alternately 12 dift'erent sets of story
items. ... To each of the 2 items, the model consistently expressed judgmental responses in
1
opposition to the child's moral orientation and the experimenter reinforced the model's
. . .
behavior with verbal approval responses such as "Very good," "That's fine." and "That's
good." The child was similarly reinforced whenever he adopted the model's class of moral
judgments in response to his own set of items.'' (E) [This is called "social reinforcement."]
Attitudes Toward Blacks "Racial attitudes were determined ... by the use of the
Multifactor Racial Attitude Inventory (Woodmansee & Cook, 1967)."-'* (M) This is a
common form of operational definition: specification of the instrument used to measure a
variable or variables. It is good practice to refer to original sources, as was done here.
Self-Esteem ". . .is measured by a 10-item index. . .
."-^ (M) The authors then
describe the sources of the items, the kind of scale, and the scoring.
Achievement (Academic Performance) "Academic performance . . . is operational-
ized as the respondents' report in 1966 of their overall grades for the previous year (which
was the ninth grade, 1965-1966)."-^ (M) Note that this is a self-report measure, which
may be subject to considerable error.
Race "All students were administered the TAQ (an anxiety scale) by either a
. . .
white or a Negro experimenter."-^ (E) It was noted that the assignment of subjects was
done at random. This operational definition is unusual. Race is usually a measured varia-
ble.
Memory: Recall and Recognition "... two basic methods. One, recall, is to ask the
subject to recite what he remembers of the items shown him, giving him a point for each
item that matches one on the stimulus list. The other testing procedure, recognition, is to
^"N. Miller, "Learning of Visceral and Glandular Responses," Science. 163 (1969), 434-445 (438).
-'J. Kounin and P. Doyle. "Degree of Continuity of a Lesson's Signal System and the Task Involvement of
Children," Journal of Educational Psychology. 67 (1975), 159-164.
"W. Verplanck, "The Control of the Content of Conversation: Reinforcement of Statements of Opinion."
Journal of Abnormal and Social Psychology. 51 (19551, 668-676.
-'a. BanduraandF. McDonald, "Influence of Social Reinforcement and the Behavior of Models in Shaping
Children's Moral Judgments." Journal of Abnormal and Social Psychology. 67 (1963), 274-281.
""S. Jones and S. Cook, "The Influence of Attitude on Judgments of the Effectiveness of Alternative Social
Policies." Journal of Personalityand Social Psychology. 32 (1975), Idl-lli.
"J. Bachman and P. O'Malley. "Self-Esteem in Young Men: A Longitudinal Analysis of the Impact of
Educational and Occupational Attainment." Journal of Personality and Social Psychology 35 (1977), 365-380 .
(368).
^*/Wd., p. 369.
"S. Baratz. "Effect of Race of Experimenter. Instructions, and Comparison Population Upon Level of
Reported Anxiety in Negro Subjects," Journal of Personality and Social Psychology. 7 (1967). 194-196.
40 • The Language and Approach of Science
show the subject test items and ask him to decide whether or not they were part of the
stimulus list."^^ (M)
Ingratiation "Half of the subject groups were given the following instructions which
were designed to create an accuracy set." (Then instructions to give accurate answers in
an interview were given.) "Subjects in the remaining groups were given instructions to
create an ingratiation set."'"^ (Appropriate instructions followed.) (E)
Discrimination Against Jews '
'This tendency was assessed through the extent of the
'"^^
manager's agreement with four (The items then followed.
. . . questionnaire items.
One of them was; "Anyone who employs many people should be careful not to hire a
large percentage of Jews." Subjects expressed agreement and disagreement on a six-point
scale.) (M)
Values " 'Rank the ten goals in the order of their importance to you.' ( 1 ) financial
success; (2) being liked; (3) success in family being intellectually capable; (5) life; (4)
living by religious principles; (6) helping others; (7) being normal, well-adjusted; (8)
cooperating with others; (9) doing a thorough job; (10) occupational success."^' (M)
Democracy (Political Democracy) "The index [of political democracy] consists of
three indicators of popular sovereignty and three of political liberties. The three measures
of popular sovereignty are: (1) fairness of elections. (2) effective executive selection, and
(3) legislative selection. The indicators of political liberties are; (4) freedom of the press,
(5) freedom of group opposition, and (6) government sanctions.""^- (M) The author gives
operational details of the six indicators in an appendix (pp. 585-586). Note that he used
the term "indicators." Indicators — usually called social indicators — are variables that
measure (indicate) the main features of a society; traffic safety, mental disease, wealth
(and its distribution), access to education, crime, home ownership, teachers" salaries,
investments, and so on.^-
Thebenefits of operational thinking have been great. Indeed, operationism has been
and one of the most significant and important movements of our times. Extreme
is
... I would say that operational thinking makes better scientists. The operationist is forced to
remove the fuzz from his empirical concepts. . . .
-"D. Norman, Memon- and Aueiuion: An Intioduction to Human Information Processing, 2nd ed. New
York: Wiley. 1976. p. 97.
-''E. Jones, Ingraiiation: A Social P syclwlogical Analysis. New York: Appleton-Century-Crofts. 1964. p. 53.
'"R. Quinn. R. Kahn. J. Tabor, and L. Gordon. The Chosen Few: A Study of Discrimination in Executive
Selection. Ann Arbor: Institute for Social Research. Survey Research Center. University of Michigan, 1968,
pp. 7-8.
^'
T. Newcomb. The Acquaintance Process. New York: Holt, Rinehart and Winston. 1 96 1 , pp. 40 and 83.
^"K. Bollen, "Political Democracy and the Timing of Development." American Sociological Review, 44
(1979). 572-587
(580). This is a particularly good e.iample of the operational definition of a comple.\ concept.
Moreover, an e.xcellent description of the ingredients of democracy.
it is
"See R. Bauer, ed.. Social Indicators. Cambridge. Mass.: MIT Press. 1966; E. Sheldon and R. Parke.
"Social Indicators." Science, 188 (1975). 693-699.
^''B, Underwood. Psychological Research. New York: Appleton. 1957, p. 53.
Constructs, Variables, and Definitions • 41
Study Suggestions
1. Write operational definitions for five or six of the following constructs. When possible, write
two such Jclinitions: an experiniental one and a measured one.
reinforcement punitiveness
achievement reading ability
underachievement needs
leadership interests
transfer of training delinquency
level of aspiration need for affiliation
organizational conflict conformity
political preference marital satisfaction
Some of these concepts or variables — for example, needs and transfer of training — may be difficult
to define operationally. Why.'
2. Can any of the variables in 1 . above, be both independent and dependent variables? Which
ones?
3. It is instructive and broadening for specialists to read outside their fields. This is particularly
true for students of behavioral research. It is suggested that the student of a particular field read two
or three research studies in one of the best journals of another field. If you are in psychology, read a
sociology journal, say the American Socioloi;ical Review.If you are in education or sociology, read
apsychology journal, say the Journal of Personality and Social Psychology or the Journal of
Experimental Psychology. Students not in education can sample the Journal of Educational Psy-
chology or the American Educational Research Journal. When you read, jot down the names of the
variables and compare them to the variables in your own field. Are they primarily active or attribute
variables? Note, for instance, that psychology's variables are more "active" than sociology's.
What implications do the variables of a field have for its research?
PART TWO
Sets, relations,
and variance
Chapter 4
^Sets
The concept of "set" is one of the most powerful and useful of mathematical ideas in
understanding methodological aspects of research. Sets and their elements are the primi-
tive materials with which mathematics works. Even if we are unaware of it, sets and set
theory are foundations of our descriptive, logical, and analytic thinking and operating.
They are the basis of virtually all else in this book. They are the foundations upon which
we erect the complexities of numerical, categorical, and statistical analysis, even though
we do not always make the set basis of our thinking and work explicit. For example, set
theory provides us with an unambiguous definition of relations. It helps us approach and
understand probability and sampling. It is first cousin to logic. And it helps us to under-
stand the highly important subject of categories and categorizing the objects of the world.
Moreover, set thinking can even help us to understand that difficult problem of human
communication: confusion caused by mixing levels of discourse.
Science works basically with group, class, or set concepts. When scientists discuss
individual events or objects, they do so by considering such objects as members of sets of
objects. But this is true of human discourse in general. We say "goose." but the word
"goose" is meaningless without the concept of a goose-like group called "geese." When
we talk about a child and his problems, we inevitably must talk of the groups, classes, or
sets of objects to which he belongs: A seven-year old (first set), second-grade (second
set), bright (third set), and healthy (fourth set), boy (fifth set).
A set is a well-defined collection of objects or elements. ' A set is well defined when it
'J. Kemeny, J.Snell. and G. Thompson. Introduction to Finite Mathematics. 2d ed. Englewood Cliffs.
N.J.; Prentice-Hall, 1966. p. 58.
46 • Sets, Relations, and Variance
is possible to whether a given object does or does not belong to the set. Terms like
tell
class, school, family, flock, and group indicate sets. There are two ways to define a set:
(1) by listing all the members of the set, and (2) by giving a rule for determining whether
objects do or do not belong to the set. Call (1) a "list" definition and (2) a "rule"
definition. In research the rule definition is usually used, although there are cases where
all members of For example, suppose we study
a set are actually or imaginatively listed.
the relationbetween voting behavior and political preference. Political preference can be
defined as being a registered Republican or Democrat. We then have a large set of all
people with political preferences with two smaller subsets: the subset of Republicans and
the subset of Democrats. This is a rule definition of sets. Of course, we might list all
registered Democrats and all registered Republicans to define our two subsets, but this is
often difficult if not impossible. Besides, it is unnecessary; the rule is usually sufficient.
Such might be: A Republican is any person who is registered in the Republican
a rule
party. Another such rule might be: A Republican is any person who says he is a Republi-
can.
SUBSETS
A subset of a set is from selecting sets from an original set. Each subset of
a set that results
a set is part of the original set. succinctly and accurately. "A set S is a subset of a
More
set A whenever all the elements of S are elements of A."'^ We designate sets by capital
letters: A. B. K, L, X, Y, and so forth. If 6 is a subset of A. we write B C A. which means
"B is a subset of A," "6 is contained in A." or "All members of 6 are also members
of A."
Whenever a population is sampled, the samples are subsets of the population. Suppose
an investigator samples four eleventh-grade classes out of all the eleventh-grade classes in
a large high school. The four classes form a subset of the population of all the eleventh-
grade classes. Each of the four classes of the the sample, too, can be considered a subset
of the four classes — and also of the total population of classes. All the children of the four
classes can be broken down into two subsets of boys and girls. Whenever a researcher
breaks down or partitions a population or a sample into two or more groups he is "creat-
ing" subsets using a "rule" or criterion to do so. Examples are numerous: religious
preferences into Protestant, Catholic, Jew; intelligence into high and low; and so on. Even
experimental conditions can be so viewed. The classic experimental-control group idea is
a set-subset idea. Individuals are put into the experimental group; this is a subset of the
whole sample. All other individuals used in the experiment (the control-group individuals)
form a subset, too.
SET OPERATIONS
There are two basic set operations: intersection and union. An operation is simply "a
doing-something to." In arithmetic we add. subtract, multiply, and divide. We "inter-
sect" and "union" sets. We also "negate" them.
Intersection is the overlapping of two or more sets; it is the elements shared in com-
mon by the two or more sets. The symbol for intersection is fl (read "intersection" or
"cap"). The intersection of the sets A and B is written A (1 B. and A D B if. itself a set.
^R. Kershner and L. Wilcox, The Aitalomy of Malhemalics. New York: Ronald, 1950. p. 35.
Sets • 47
More precisely, it is the set that contains those elements oi A and B that belong to both A
and B. Intersection is also written A B. or simply AB.
Let .4 = {0, 1 , 2. 3}; let B = {2, 3, 4, 5}. (Note that we use braces. "{ }," to symbol-
Then A fl
ize sets. ) {2. 3}. This is shown B = in Figure 4. 1 . /^ fl B. or {2, 3}. is a new set
composed of the members common to both sets. Note A D Bthat also indicates the
rchition between the sets, the elements shared in common by A and B.
The union of two sets is written A U 6. -4 US is a set that contains all the members of
A and all the nieinhers ot B. Mathematicians define A U B as a set that contains those
elements that belong either Ui A or to B or to both. In other words, we "add"" the elements
of A to those of S to form a new set /\ U B. Take the example of Figure 4.\. A included 0,
1, 2. and 3: B included 2. 3. 4. and 5. A U B = {0. I. 2, 3. 4. 5}. The union of A and B
in Figure 4. is indicated by the whole area of the two circles. Note that we do not count
1
large. If we sample the schools of a large county, then U is all the schools in the county, a
rather large U. U might also be all the children or all the teachers in these schools, still
larger U's.
In research it is important to know the U one is studying. Ambiguity in the definition
of U can lead to erroneous conclusions. It is known, for example, that social classes differ
Figure 4.1
48 • Sets, Relations, and Variance
members, the notion is quite useful, even indispensable. With it we can convey certain
ideas economically and unambiguously. To indicate that there is no relation between two
sets of data for example, we can write the set equation A H B = E. which simply says that
the intersection of the sets A and B is empty, meaning that no member of A is a member of
B, and vice versa.
LetA = {\,2. 3}; Letfi = {4, 5, 6}. Then A nB = £. Clearly there are no members
common to A and B. The set of possibilities of the Democratic and Republican presiden-
tial candidates both winning the national electionis empty. The set of occurrences of rain
without clouds empty. The empty set, then, is another way of expressing the falsity of
is
propositions. In this case we can say that the statement "Rain without clouds" is false. In "
set language this can be expressed P n —Q = E, where P = the set of all occuirences of
rain, Q= the set of all occurrences of clouds, and ~Q = the set of all occurrences of no
clouds.
The negation or complement of the set A is written — A. \{ means all members of U
not in A. If we let A = all men. when U = all human beings, then — A = a\\ women
(not-men). Simple dichotomization seems to be a fundamental basis of human thinking. In
order to think, categorization is necessary: one must, at the most elementary level, sepa-
rate objects into those belonging to a certain set and those not belonging to the set. We
must distinguish men and not-men, me and not-me, early and not-early, good and not-
good.
Iff/ = 1. 2. 3, 4},and/i = {0, 1}, then ~ A = {1. 3, 4}. A and ~ A are of course
{0.
subsets of U An important property of sets and their negation is expressed in the set
.
SET DIAGRAMS
We now pull together and illustrate the basic set ideas already presented by diagramming
them. Sets can be depicted with various kinds of figures, but rectangles and circles are
Figure 4.2
-'E. Hilgard. R, C. Atkinson, and R. L. Atkinson. Inlioduction to Psychology. 6th ed. New York: Harcourt
Brace Jovanovich, 1975. p. 490.
Sets • 49
AVB
Figure 4.3
ordinarily used. They have been adapted from a system invented by John Venn. In this
book rectangles, circles, and ovals will be used. Look at Figure 4.2. U is represented by
the rectangle. All members of the universe under discussion are in U. All members of U
not in A form another subset of U: ~ A. Note, again, that A U —A = U. Note, too, that
A n ~A = E, that is, there are no members common to both A and —A.
Next, we depict, in Figure 4.3, two sets, A and B, both subsets of U. From the
diagram it can be seen that A (^ B = E. We adopt a convention; when we wish to indicate
a set or a subset, we shade it either horizontally, vertically, or diagonally. The set A U 6
has been shaded in Figure 4.3.
Intersection, probably the most important set notion from the point of view of this
book, is indicated by the shaded portion of Figure 4.4. The situation can be expressed by
the equation A fl 5 t^ £; the intersection of the sets A and B is not empty.
When two sets, A and 6, are equal, they have the same set elements or members. The
Venn diagram would show two congruent circles in U. In effect, only one circle would
show. When A= B, then A r\ B = AiJ B = A = B.
We diagram A C B: A is a subset of S, in Figure 4.5. B has been shaded horizontally,
A vertically. Note that A U B = B (whole shaded area) and A n B = A (area shaded both
horizontally and vertically). All members of A are also in B, or all a's are also fo's, if we
let a = any member of A and b = any member of B.
A r\B
Figure 4.4
50 • Sets, Relations, and Variance
A CB
Figure 4.5
The triply hatched area shows A Ci B H C. There are four intersections, each hatched
differently: A D B. A Ci C. B n C. and A n B n C.
Although four or more sets can be diagrammed, such diagrams become cumbersome
and not easy to draw and inspect. There is no reason, however, why the intersection and
union operations cannot be applied symbolically to four or more sets.
A| UA. = f/ and A, n A. = £
fii UB2 = t/ and fi| nB2 = £
Diagrams make this clearer. The partitioning of U. represented by a rectangle, sepa-
rately into the subsets A; and A2 and into B| and Bi, is shown in Figure 4.7. Both
A DB nc
Figure 4.6
Bx
52 • Sets, Relations, and Variance
Table 4.1 Crossbreak Table: Relation Between Social Class and Weaning,
Miller and Swanson Study
Sets •
53
In research, we must be very carelul not to mix or shift our levels of discourse, or to
do so only knowingly and consciously. Set-thinking helps us avoid such mixing and
shifting. As an extreme example, suppose an investigator decided to study the toilet
training, authoritarianism, musical aptitude, creativity, intelligence, reading achieve-
ment, and general scholastic achievement of ninth-grade youngsters. While it is conceiva-
ble that some sort of relations can be teased out of this array of variables, it is more
conceivable that it is an intellectual mess. At any rate, remember sets. Ask yourself: "Do
the objects 1 am discussing or am about to discuss belong to the set or sets of my present
discussion?" If so. then you are on one level of discourse. If not, then another level of
discourse, another set. or set of sets, is entering the discussion. If this happens without
your knowing it. the result is confusion. In short, ask: "What are U and the subsets of
U'V
Research requires precise definitions of universal sets. "Precise" means: give a clear
rule that tellsyou when an object is or is not a member of U. Similarly, define subsets of
U and the subsets of the subsets of U. If the objects of U are people, then you cannot have
a subset with objects that are not people. (Though you might have a set A of people and
the set -A of "not-people." this logically amounts to U being people. "Not-people" is
in this case a subset of "people," by definition or convention.)
The set idea is fundamental in human thinking. This is because all or most thinking
probably depends on putting things into categories and labeling the categories, as indi-
cated earlier.^ What we do is to group together classes of objects things, people, events, —
phenomena in general — and name these classes. Such names are then concepts, labels
that we no longer need to learn anew and that we can use for efficient thinking.
is also a general and widely applicable tool of conceptual and analytic
Set theory
thinking. most important applications pertinent to research methodology are probably
Its
to the study of relations, logic, sampling, probability, measurement, and data analysis, as
indicated earlier. But sets can be applied to other areas and problems that are not consid-
ered technical in the sense, say, that probability and measurement are. Piaget, for exam-
ple, has used set algebra to help explain the thinking of children.** Hunt has applied sets to
his study of concept learning.' Coombs presented his important theory of data largely in
set terms. '° Warr and Smith, with remarkable ingenuity and insight, used set theory to test
different models of inference about personal traits —
with rather surprising results. Later
'
'
in this book, measurement will be defined using a single set-theoretic equation. In addi-
tion, basic principles of sampling, analysis, and of research design will be clarified with
sets and set theory. Unfortunately, most social scientists and educators are still not aware
of the generality, power, and flexibility of set thinking. It can be safely predicted, how-
ever, that researchers in the social sciences and education will find set thinking and theory
increasingly useful in the conceptualization of theoretical and research problems.
Study Suggestions
1. Draw two overlapping circles, enclosed in a rectangle. Label the following parts: the univer-
sal set U. the subsets A and B, the intersection of A and B, and the union of A and B.
'J. Bruner. J. Goodnow, and G. Austin. A Study of Thinking. New York: Wiley. 1956. chap. 1; E. Rosch.
"Principles of Categorization." InE. Rosch and B. Lloyd, eds.. Cognition and Categorization. Hillsdale. N. J.:
Erlbaum Associates. 1978. chap. 2.
"J. Piaget. Logic and Psychology. New York: Basic Books, 1957. Also. B. Inhelder and J. Piaget. The
Growth of Logical Thinking from Childhood to Adolescence. New York; Basic Books, 1958.
"E. Hunt. Concept Learning. New York: Wiley, 1962.
'°C. Coombs. "A Theory of Data," Psychological Review. 67 (1960), 143-159.
'
P. Warr and J. Smith, 'Combining Information about People: Comparisons Between Six Models,
'
'
'
Jour- '
(a) Ifyou were working on a research problem involving fifth-grade children, what part of
the diagram would indicate the children from which you might draw samples?
(b) What might the sets A and B represent?
(c) What meaning might the intersection of A and B have?
(d) How would you have to change the diagram to represent the empty set? Under what
conditions would such a diagram have research meaning?
What is the meaning of the following sets — that is. what would we call any object in the sets?
3. Make a cross partition using the variables socioeconomic status and voting preference (Dem-
ocrat and Republican). Can
sample of American individuals be unambiguously assigned to the
a
cells of the cross partition? Are the cells exhaustive? Are they disjoint? Why are these two condi-
tions necessary?
4. Under what conditions will the following set equation be true?
^ Relations
Relations are the essence of knowledge. What is important in science is not knowledge
of particulars but knowledge of the relations among phenomena. We know that large
things are large only by comparing them to smaller things. We thus establish the relations
"greater than" and "less than." Educational scientists can "know" about achievement
only as they study achievement in relation to nonachievement and in relation to other
variables. When they learn that children of higher intelligence generally do well in school
and that children of lower intelligence often do less well, they "know" an important facet
of achievement. When they also learn that middle-class children tend to do better in
school than working-class children, they are beginning to understand "achievement."
They are learning about the relations that give meaning to the concept of achievement.
The relations between intelligence and achievement, between social class and achieve-
ment, and, indeed, between any variables are the basic "stuff" of science.
The relational nature of human knowledge is clearly seen even when seemingly obvi-
ous "facts' are analyzed. Is a stone hard? To say whether this statement is true or false we
must examine sets and subsets of different kinds of stones. Then, after operationally
defining "hard," we compare the "hardness" of stones to other "hardnesses." The
"simplest" facts turn out, on analysis, to be not so simple. Northrop, discussing concepts
and facts, says, "The only way to get pure facts, independent of all concepts and theory,
"'
is merely to look at them and forthwith to remain perpetually dumb. . . .
The dictionary tells us that a relation is a bond, a connection, a kinship. For most
people this definition is good enough. But what do "bond," "connection," and "kin-
'F. Northrop, The Logic of the Sciences and the Humanities. New York: Macmillan. 1947, p 317. See
also, M. Cohen and E. Nagel, An Introduction to Logic and Scientific Method. New York; Harcourt. 1934, pp.
217-219.
56 •
Sets, Relations, and Variance
ship" mean? Again, tiie dictionary says that a bond is a tie, a binding force, and that a
connection among other things, a union, a relationship, an alliance. But a union, a tie,
is,
between what? And what do "union," "tie." and "binding force" mean? Such defini-
tions, while intuitively helpful, are too ambiguous for scientific use.
Intelligence Achievement
136 55
125 57
118 42
110 48
100 42
97 35
90 32
Consider the two one set of pairs. Then this set is a relation.
sets as
If we graph two sets of scores on X and Y axes as we did in Chapter 3 ( Figure 3 3
the , .
)
the relation becomes easier to "see." This has been done in Figure 5.1. Each point is
defined by two scores. For example, the point farthest to the right is defined by (136, 55),
and the point farthest to the left is (90, 32). Graphs like Figure 5.1 are highly useful and
succinct ways to express relations. One sees at a glance, for instance, that higher values of
X are accompanied by higher values of Y, and lower values of X by lower values of Y. As
we will see in a later chapter, it is also possible to draw a line through the plotted points of
Figure 5.1, from lower left to upper right. (The reader should try this. This line, called a )
regression line, also expresses the relation between X and Y. between intelligence and
achievement, but it also succinctly gives us considerably more information about the
relation: namely its direction and magnitude.
We are now ready to define "relation" formally: A relation is a set of ordered pairs
Any relation is a set, a certain kind of set: a set of ordered pairs. An ordered pair is two
objects, or a set of two elements, in which there is a fixed order for the objects to appear.
Actually, we speak of ordered pairs which means, as indicated earlier, that the members
Relations • 57
60
58 •
Sets, Relations, and Variance
Figure 5.2
sets of ordered pairs we pick, it is a relation. It is up to us to decide whether or not the sets
we pick make scientific sense according to the dictates of the problems to which we are
seeking answers.
The reader may wonder why so much trouble has been taken to define relations. The
basic answer is simple: Almost all science pursues and studies relations. There is literally
no empirical way to "know" anything except through its relations to other things, as
indicated earlier. If, like Suedfeld and Rank, we are interested in the success of revolu-
tionary leaders (for example, Jefferson, Mao Tse-tung, and Castro), we have to relate that
success or lack of success to other variables.-^ To explain a phenomenon like revolutionary
success, we must "discover" its determinants, the relations it has with other pertinent
variables. Suedfeld and Rank "explained" revolutionary success by relating such success
to the conceptualcomplexity of revolutionary leaders. For revolutionary success before a
revolution, conceptual simplicityis needed, but after a revolution conceptual complexity
Any objects— people, numbers, gambling outcomes, points in space, symbols, and so on
and on — can be members of and can besets related in the ordered-pair sense. It is said
that the members of one set are mapped on to the members of another set by means of a
rule of correspondence. A rule of correspondence is a prescription or a formula that tells
ushow to map the objects of one set on to the objects of another set. It tells us, in brief,
how the correspondence between set members is achieved. Study Figure 5.3, which
shows the relation between the names of five individuals and the symbols 1 and 0, which
stand for male (1) and female (0). We have here a mapping of sex (1 and 0) on to the
names. This is of course a relation, each name having either or 0, male or female, 1
assigned to it.
In a relation the two sets whose "objects'" are being related are called the domain and
the range, or D and R. D is the set of first elements, and R the set of second elements. In
Figure 5.3, we assigned to male and to female. To each member of the domain the
1
Graphs
A graph a drawing in which the two members of each ordered pair of a relation are
is
plotted ontwo axes, X and Y (or any appropriate designation). Figure 5.1 is a graph of the
ordered pairs of the fictitious intelligence and achievement scores given earlier. We can
see from that graph that the ordered pairs tend to "go together": high values of Y go with
high values of X. and low values of Y go with low values of X.
A more interesting set of ordered pairs is graphed in Figure 5.4. The numbers used to
make the graph are from a fascinating study by Miller and DiCara, in which seven rats
Domain Range
Figure 5.3
60 • Sets, Relations, and Variance
45 r
40
35
. 30
It
<
25
20
15
Relations • 61
gorics. Ot the 31 subjects induced to lie, 20 complied with the demands ot the experi-
menter; did not comply with the demands. Of the 31 subjects who were not induced to
1 1
lie, 1 1 complied and 20 did not comply. The data are consistent with the hypothesis. In a
later chapter we will study in detail how to analyze and interpret frequency data and tables
of this kind.
Comply 20 1
Not-Comply 1 20
31 31
The point of Table 5.1 is that a relation and the evidence on the nature of the relation
are expressed in the table. In this case the tabled data are in frequency form. (A frequency
is the number of members of
sets and subsets.) The table itself is a cross partition, often
called a crossbreak, in which one variable of the relation is set up against another variable
of the relation. The two variable labels appear on the top and side of the table, as indicated
earlier. The direction and magnitude of the relation itself is expressed by the relative sizes
of the frequencies in the cells of the table. In Table 5.1 many more of the subjects induced
to lie (20 of 31) complied than did the subjects not induced to lie of 31). ( 1 1
A different kind of table presents means, arithmetic averages, in the body of the table.
The means express the dependent variable. If there is only one independent variable, its
categories are labeled at the top of the table. If there are two or more independent varia-
bles, their categories can be presented in various ways at the top and sides of tables, as we
will see in later chapters. An example is given in Table 5.2, which is the simplest form
such a table can take. In this study by Clark and Walberg, the effect of massive reinforce-
ment on the reading achievement of underachieving black children was studied.'' It was
expected that the reinforcement given to the experimental group children would increase
their reading scores compared to those of a similar group of children who were not
reinforced (""Control"" in the table). As can be seen, the experimental group mean is
larger than the control group mean. Is the difference between the means "large" or
"small""? We will later learn how to assess the size and meaning of such differences. Here
we are interested only in why the table expresses a relation.
Experimental Control
31.62' 26.86
In tables of this kind a relation is always expressed or implied. "^' In the present case
there are two variables being related: reinforcement and reading achievement. The rubric
""C. Clark and H. Walberg. "The Influence of Massive Rewards on Reading Achievement in Potential
Urban School Dropouts." American Educational Research Journal. 5 (1968). 305-310.
'"Tables as simple as this one are rarely given in the literature. It saves space merely to mention the two
means in the text of a report. Moreover, there can be more than two means compared. The principle is the same,
however: the means "express" the dependent variable, and the differences among them the presumed effect of
the independent variable.
62 • Sets, Relations, and Variance
ceived and the control group did not receive. This is the independent variable. The two
means in the table express the reading achievement of the two groups of children, the
dependent variable. If the means differ sufficiently, then it can be assumed that reinforce-
ment had an effect on reading achievement.
Tables of means are extremely important in behavioral research, especially in experi-
mental research. There can be two, three, or more independent variables, and they can
express the separate and combined effects of these variables on a dependent variable, or
even on two or more dependent variables. The central point is that relations are always
studied, even though it is not always easy to conceptualize and to state the relations.
Although we briefly examined relations and graphs earlier, it will be profitable to pursue
this important topic further. Suppose we have two sets of scores of the same individuals
on two tests, X and Y:
The two sets form a set of ordered pairs. This set is of course a relation. It can also be
written, letting R stand for relation, /? = {(1,1), (2,1). (2,2), (3,3)}. It is plotted in the
graph of Figure 5.5.
We can often get a rough idea of the direction and degree of a relation by inspection of
lists of ordered pairs, but such a method is imprecise. Graphs, such as those of Figures 5.1
and 5.5, tell us more. It can more easily be "seen" that Y values "go along" with X
values: higher values of Y accompany higher values of X. and lower values of Y accom-
pany lower values of X. In this case the relation, or correlation, as it is also called, is
positive. (If we had the equation, R = {(1,3), (2,1), (2,2), (3,1)}, the relation would be
negative. The student should plot these values and note their direction and meaning.) If
the equation were R = {( ,2), (2, ), (2,2), (3,2)}, the relation would be null or zero. This
1
is plotted in Figure 5.6. It can be seen that Y values do not "go along" with X values in
any systematic way. This does not mean that there is "no" relation. There is always a
relation— by definition —
since there is a set of ordered pairs. It is commonly said, how-
ever, that there is "no" relation. It is more accurate to say that the relation is null or zero.
-
Relations • 63
Y ~
1
-
_L
1 2 3
X
Figure 5.6
negative relation as the case may be. If they do not covary, it is said there is "no"
relation. The most useful such indices range from -1-1.00 through to -1.00, -1-1.00
indicating a perfect positive relation, - 1 .00 a perfect negative relation, and no discerni-
ble relation, or zero relation. Some indices range only from to -f-1.00. Other indices
may take on other values.
Most coefficients of relation tell us how similar the rank orders of two sets of measures
are. Table 5.3 presents three examples to illustrate this going together of rank orders. The
coefficients of correlation are given with each of the sets of ordered pairs. I is obvious: the
rank orders of the X and Y scores of I go together perfectly. So do the X and Y scores of
II, but in the opposite direction. In III, no relation between the rank orders can be dis-
cerned. In 1 and II, one can predict perfectly from X to Y. but in III one cannot predict
Table 5.3 Three Sets of Ordered Pairs Showing Different Directions and
Degrees of Correlation
To put some flesh on the rather abstract bones of our discussion of relations, let's look at
two interesting examples of relations and correlation. In studying emotional aspects of
prejudice. Cooper calculated two sets of ranks. He had his subjects respond to nine
national and ethnic groups by choosing from each of all the possible pairs of such groups
supported by high levels of emotion. Examination of the two sets of rank orders — the set
of ordered pairs — shows that they tend to covary, thus supporting the hypothesis. The
actual coefficient of correlation —called a rank-order coefficient of correlation— was
.82,
a high value. Evidently different groups evoke different emotional responses, and groups
perceived negatively evoke the stronger responses, as Cooper predicted.
Group
Relations • 65
Table 5.5 The Relation Between Religious Affiliation and Output of Scholarly Doctorates in the
United States, Hardy Study
Religious Group
66 • Sets, Relations, and Variance
neously, and the dependent variable, and second, of determining how much each indepen-
dent variable, X[, .vi, and .v,, influences the dependent variable, y. Though now much
more complex, the problem is still a relation, a set of ordered pairs.
Intelligence
Relations •
67
68 •
Sets, Relations, and Variance
dent's natural bewilderment with the presumed mysteries of multivariate thinking should
be dissipated and replaced by admiration and perhaps a bit of awe and excitement at these
Study Suggestions
Kershner R.. and Wilcox, L. The Anatonn of Mathematics. New York: Ronald, 1950, pp.
41-60.
2. Six examples of relations are given below. Assume that the first-named set is the domain and
the second the range. Why are all of these relations?
3. An educational investigator has studied the relation between anxiety and school achieve-
ment. Express the relation in set language.
4. Suppose that you wish to study the relations among the following variables: intelligence,
socioeconomic status, need for achievement, and school achievement. Set up two alternative mod-
els that "explain" school achievement. Draw path diagrams of the two models.
Chapter 6
® Variance and
Covariance
To STUDY scientific problems and to answer scientific questions, we must study differ-
ences among phenomena. In Chapter 5, we examined relations among variables; in a
sense, we were studying similarities. Now we concentrate on differences because without
differences, without variation, there is no technical way to determine the relations among
variables. If we want to study the relation between race and achievement, for instance, we
are helpless if we have only achievement measures of white children. We must have
achievement measures of children of more than one race. In short, race must vary; it must
have variance. It is necessary to explore the variance notion analytically and in some
depth. To do so adequately, it is also necessary to skim some of the cream off the milk of
statistics.
Studying sets of numbers as they are is unwieldy. It is usually necessary to reduce the
sets intwo ways: by calculating averages or measures of central tendency, and by calcu-
lating measures of variability. The measure of central tendency used in this book is the
mean. The measure of variability most used is the variance. Both kinds of measures
epitomize sets of scores, but in different ways. They are both "summaries" of whole sets
of scores, "'summaries"' that express two important facets of the sets of scores: their
central or average tendency and their variability. Solving research problems without these
measures is next to impossible. We start our study of variance, then, with some simple
computations.
70 •
Sets, Relations, and Variance
SX
M= (6.1)
n
n = the number of cases in the set of scores; S means "the sum of" or "add them up." X
stands for any one of the scores, that is, each score is anX. The formula, then, says, "Add
the scores and divide by the number of cases in the set." Thus:
M =
1+2 + 3
^
+ 4 + 5
= -
15
= 3
Calculating the variance, while not as simple as calculating the mean, is still simple.
The formula is:'
V= — (6.2)
n
V means variance; n and S are the same as in Equation 6.1. 1x^ is called the sum of
squares: it needs some explanation. The scores are listed in a column:
Variance and Covariance • 71
It is called this because, obviously, it is the mean of the v^'s. Clearly it is not difficult to
calculate the mean and the variance.
The question is; Why calculate the mean and the variance'.' The rationale for calculat-
ing the mean is easily dispo.sed of. The mean expresses the general level, the center of
gravity, of a set of measures. It is, in general, a good representative of the level of a
group's characteristics or performance. It also has certain desirable statistical properties,
and is the most ubiquitous statistic of the behavioral sciences. In much behavioral re-
search, for example, means of different experimental groups are compared to study rela-
tions, as pointed out in Chapter 5. We may be testing the relation between organizational
climates and productivity, for instance. We may have used three kinds of climates and
may be interested in the question of which climate has the greatest effect on productivity.
Insuch cases means are customarily compared. For instance, of three groups, each oper-
ating underone of three climates, A\. At, and A3, which has the greatest mean on, say, a
measure of productivity?
The rationale for computing and using the variance in research is more difficult to
explain. In the usual case of ordinary scores the variance is a measure of the dispersion of
the set of scores. It tells us how much the scores are spread out. If a group of pupils is
very heterogeneous in reading achievement, then the variance of their reading scores will
be large compared to the variance of a group that is homogeneous in reading achievement.
The variance, then, is a measure of the spread of the scores; it describes the extent to
which the scores differ from each other.'' The remainder of this chapter and later parts of
the book will explore other aspects of the use of the variance statistic.
KINDS OF VARIANCE
Variances come in a number of forms. When you read the research and technical litera-
ture, you will frequently come across the term, sometimes with a qualifying adjective,
sometimes not. To understand the literature, it is necessary to have a good idea of the
characteristics and purposes of these different variances. And to design and do research,
one must have a rather thorough understanding of the variance concept as well as consid-
erable mastery of statistical variance notions and manipulations.
suggested that the student supplement his study with study of appropriate sections of an elementary statistics
text, since it will not be possible in this book to discuss all the facets of meaning and interpretation of means,
variances, and standard deviations.
72 • Sets, Relations, and Variance
were also a complete list of intelligence test scores of these people the variance could be —
simply if wearily computed. No such list exists. So samples representative samples of — —
Americans are tested and means and variances computed. The samples are used to esti-
mate the mean and variance of the whole population.
Sampling variance is the variance of statistics computed from samples. The means of
four random samples drawn from a population will differ. If the sampling is random and
the samples are large enough, the means should not vary too much. That is, the variance
of the means should be relatively small.'*
Systematic Variance
Perhaps the most general way to classify variances is as systematic variance and error
variance. Systematic variance is the variation in measures due to some known or un-
known influences that "cause"" the scores to lean in one direction more than another. Any
natural or man-made influences that cause events to happen in a certain predictable way
are systematic influences. The achievement test scores of the children in a wealthy subur-
ban school will tend to be systematically higher than the scores of the children in a city
slum area school. Expert teaching may systematically influence the achievement of chil-
dren — as compared to the achievement of children taught inexpertly.
There are many, many causes of systematic variance. Scientists seek to separate those
in which they are interested from those in which they are not interested. They also try to
separate from systematic variance variance that is random. Indeed, research may narrowly
and technically be defined as the controlled study of variances.
* Unfortunately, in much actual research only one sample is usually available — and this one sample is
frequently small. We can. however, estimate the sampling variance of the means by using what is called the
snuidanl variance ofthemean(s). (The term "standard error of the mean" is usually used. The standard error of
the mean is the square root of the standard variance of the mean.) The formula is
v„ = -^
where V,, is the standard variance of the mean, V, the variance of the sample, and «,. the size of the sample.
Notice an important conclusion that can be reached from this equation. If the size of the sample is increased.
V,„ is decreased. In other words, to be more confident that the sample is close to the population mean, make n
large. Conversely, the smaller the sample, the riskier the estimate. (See Study Suggestions 5 and 6 at the end of
the chapter.)
-^This is obtained by squaring the standard deviation reported in a test manual. The standard deviation of the
California Test of Mental Maturity for 11-year-old children, for instance, is about 15. and 15" = 225.
Variance and Covariance • 73
between-groups variance. They succeeded: the new "experimental group" animals ex-
ceeded the new "control group" animals in choosing the larger circle in the same choice
task!
This is clear and easy to see in experiments. In research that is not experimental, in
research where already existing differences between groups are studied, it is not always so
clear and easy to see that one is studying between-groups variance. But the idea is the
same. The principle may be stated in a somewhat different way: The greater the differ-
ences between groups, the more an independent variable or variables can be presumed to
have operated. If there is little difference between groups, on the other hand, then the
presumption must be that an independent variable or variables have not operated, that
their effects are too weak to be noticed, or that different influences have canceled each
other out. We judge the effects of independent variables that have been manipulated or
that have worked by between-groups variance. Whether the independent
in the past, then,
variables have or have not been manipulated, the principle is the same.
To illustrate the principle, we use the well-studied problem of the effect of anxiety on
school achievement. It is possible to manipulate anxiety by having two experimental
groups and inducing anxiety in one and not in the other. This can be done by giving each
group the same test with different instructions. We tell the members of one group that their
grades depend wholly on the test. We tell the members of the other group that the test does
not matter particularly, that its outcome will not affect grades. On the other hand, the
relation between anxiety and achievement may also be studied by comparing groups of
individuals on whom it can be assumed that different environmental and psychological
circumstances have acted to produce anxiety. (Of course, the experimentally induced
anxiety and the already existing anxiety —
the stimulus variable and the organismic varia-
ble — are not assumed to be the same.) A study to test the hypothesis that different envi-
ronmental and psychological circumstances act to produce different levels of anxiety was
done by Sarnoff et al.** The investigators predicted that, as a result of the English 1 1-plus
examinations, English school children would exhibit greater test anxiety than would
American school children. In the language of this chapter, the investigators hypothesized
a between-groups variance larger than could be expected by chance because of the differ-
ences between English and American environmental, educational, and psychological con-
ditions. (The hypothesis was supported.)
Error Variance
It is probably safe to say that the most ubiquitous kind of variance in research is error
variance. Error variance is the fluctuation or varying of measures due to chance. Error
variance is random variance. It is the variation in measures due to the usually small and
self-compensating fluctuations of measures — now here, now there; now up, now down.
The sampling variance discussed earlier in the chapter, for example, is random or error
**
variance.
It can be said that error variance is the variance in measures due to ignorance. Imagine
a great dictionary in which everything in the world — every occurrence, every event,
"I. Sarnoff. F. Lighthall. R. Waite. K. Davidson, and S. Sarason. "A Cross-Cultural Study of Anxiety
among American and English School Children," Journal of Educational Psychology. 49 (1958), 129-136.
^It will be necessary in this chapter and the next to use the notion of "random" or "randomness." Ideas of
randomness and randomization will be discussed Chapter 8. For the present, however.
in considerable detail in
randomness means that there is no known way that language of correctly describing or
can be expressed in
explaining events and their outcomes. Random events cannot be predicted, in other words. A random sample is
a subset of a universe, its members so drawn that each member of the universe has an equal chance of being
selected. This is another way of saying that, if members are randomly selected, there is no way to predict which
—
member will be selected on any one selection other things equal.
Variance and Covariance • 75
every little thing, every great thing — is given in complete detail. To understand any event
that has occurred, that is now occurring, or that will occur, all one needs
to do is look it
up in the dictionary. With this dictionary there are obviously no random or chance occur-
rences. Everything is accounted tor. In brief, there is no error variance; all is systematic
variance. Unfortunately — or more likely, fortunately — we do not have such a dictionary.
Many, many events and occurrences cannot be explained. Much variance eludes identifi-
cation and control. This is error variance — at least as long as identification and control
elude us.
While seemingly strange and even a bit bizarre, this mode of reasoning is useful.
provided we remember that some of the error variance of today may not be the error
variance of tomorrow. Suppose that we do an experiment on teaching problem-solving in
which we assign pupils to three groups at random. After we finish the experiment, we
study the differences between the three groups to see if the teaching has had an effect. We
know that the scores and the means of the groups will always show minor fluctuations,
now plus a point or two or three, now minus a point or two or three, which we can
probably never control. Something or other makes the scores and the means fluctuate in
this fashion. According to the view under discussion, they do not just fluctuate for no
reason; there is probably no "absolute randomness." Assuming determinism, there must
be some cause or causes for the fluctuations. True, we can learn some of them and
possibly control them. When we do this, however, we have systematic variance.
We find out, for instance, that sex "causes" the scores to fluctuate, since boys and
girls are mixed in the experimental groups. (We are, of course, talking figuratively here.
Obviously sex does not make scores fluctuate.) So we do the experiment and control sex
by using, say. only boys. The scores still fluctuate, though to a somewhat lesser extent.
We remove another presumed cause of the perturbations; intelligence. The scores still
fluctuate, though to a still lesser extent. We go on removing such sources of variance. We
are controlling systematic variance. We are also gradually identifying and controlling
more and more unknown variance.
Now note that before we controlled or removed these systematic variances, before we
"knew" about them, we would have to label all such variance error variance partly
through ignorance, partly through inability to do anything about such variance. We could
—
go on and on doing this and there will still be variance left over. Finally we give in; we
"know" no more; we have done all we can. There will still be variance. A practical
definition of error variance, then, would be: Error variance is the variance left over in a
set of measures after all known sources of systematic variance have been removed from
the measures. This is so important it deserves a numerical example.
'"The idea for this variable was gotten from; J. Hiller. G. Fisher, and W. Kaess, "A Computer Investigation
of Verbal Characteristics of Effective Classroom Lecturing." American Educational Research Journal, 6
(1969). 661-675. These authors, however, mtajurfd vagueness; in the above example, vagueness is an experi-
mental or manipulated variable.
76 • Sets, Relations, and Variance
A.
3 6
5 5
1 7
4 8
2 4
iW; 3 6
The means are different; they vary. There is between-groups variance. Taking the
difference between the means at face value —
we will be more precise we may
later —
conclude that vagueness in lecturing had an effect. Calculating the between-groups vari-
ance just as we did earlier, we get:
Variance and Covariance • 77
This is the total variance, V,. V, = 4.25 contains all sources of variation in the scores. We
already know that one of these is the between-groups variance, V^ = 2.25. Let us calcu-
late still another variance. We do this by calculating the variance of -4] alone and the
variance of Ai alone and then averaging the two:
A, x x~ A, .V .v^
78 • Sets, Relations, and Variance
ing, and hoping our assumption is correct, that they have been equally distributed between
the two groups.
Let us demonstrate all this another way by removing from the original set of scores the
between-groups variance, using a simple subtractive procedure. First, we let each of the
means of A| and At be equal to the total mean; we remove the between-groups variance.
The total mean is 4.5. (See above where the mean of all 10 scores was calculated.)
Second, we adjust each individual score of A] and An by subtracting or adding, as the case
may be, an appropriate constant. Since the mean of A; is 3, we add 4.5 - 3 = 1.5 to each
of the A| scores. The mean of At is 6, and 6 — 4.5 = 1 .5 is the constant to be sL4btracted
from each of the A2 scores.
Study the "corrected" scores. Compare them with the original scores. Note that they
vary less than they did before. Naturally. We removed the between-groups variance, a
sizable portion of the total variance. The variance that remains is that portion of the total
variance due, presumably, to chance. We calculate the variance of the "corrected" scores
of Aj, At, and the total, and note these surprising results:
Correction:
Variance and Covariance • 79
80 • Sets, Relations, and Variance
Correction:
Variance and Covariance • 81
COMPONENTS OF VARIANCE
The discussion so far may have convinced the student that any total variance has what will
"
be called "components of variance." The case just considered, however, included one
experimental component due to the difference between i4i and Ai, one component due to
individual differences, and a third component due to random error. We now study the case
of two components of systematic experimental variance. To do this, we synthesize the
experimental measures, creating them from known variance components. We go back-
wards, in other words. Because we start from "known" sources of variance, from
"known" scores, there will be no error in the synthesized scores.
Wehave a variable X which has three values. Let X = (0, 1 2}. We also have another,
variable Y, which has three values. Let y = {0, 2, 4}. X and Y, then, are known sources of
variance. We assume an ideal experimental situation where there are two independent
variables acting in concert to produce effects on a dependent variable, Z. That is, each
score of X operates with each score of Y to produce a dependent variable score Z. For
example, the X score, 0, has no influence. The X score, 1, operates with Y as follows:
{(1 + 0), (1 + 2),(1 + 4)}. Similarly,theXscore,2, operates withy.- ((2 + 0). (2 + 2),
(2 + 4)}. All this is easier to see if we generate Z in clear view:
82 • Sets, Relations, and Variance
Variance and Covariance • 83
We now set ourselves a problem. (Note carefully in what follows that we are going to
work with deviations from the mean, .v's and y's. and not with the original raw scores.)
We have calculated the variances of X and Y above by using the .v's and v's, that is, the
deviations from the respective means of X and Y. If we can calculate the variance of any
set of scores, is it not possible to calculate the relation between any two sets of scores in
a similar way? Is it conceivable that we can calculate the variance of the two sets simulta-
neously? And if we do so, will this be a measure of the variance of the two sets together?
Will this variance also be a measure of the relation between the two sets?
What we want to do is to use some statistical operation analogous to the set operation
of intersection, XD K. To calculate the variance of X or of Y, we squared the deviations
from the mean, the .v's or the v's, and then added and averaged them. A natural answer to
our problem is to perform an analogous operation on the v's and y's together. To calculate
the variance of X. we did this first: (.V| •
Jti), . . . , (.V4 •
.V4) = ,V|- .V4-. Why, then,
not follow this through with both .v's and y's, multiplying the ordered pairs like this:
(.V, yi), . . . , (.V4 y4)? Then, instead of writing S.v' or ly'. we write Sxv, as follows:
1.5 - .5
.5
.5
1.5
84 • Sets, Relations, and Variance
scores. The variation is aptly called covariance and is a measure of the relation between
the sets of scores.
It can be seen that the definition of relation as a set of ordered pairs leads to several
ways to define the relation of the above example:
R„ =
CoV,,.
VV_,
.
^
•
V,,
=
1.00
1.25
= .80
Variance and covariance are concepts of the highest importance in research and in the
analysis of research data. There are two main reasons. One, they summarize, so to speak,
the variability of variables and the relations among variables. This is most easily seen
when we realize that correlations are covariances. But the term also means the covarying
of variables in general. In much or most of our research we literally pursue and study
covariation of phenomena. Two, variance and covariance form the statistical backbone of
multivariate analysis, as we will see toward the end of the book. Most discussions of the
analysis of data are based on variances and covariances. Analysis of variance, for exam-
ple, studies different sources of variance of observations, mostly in experiments, as indi-
cated earlier. Factor analysis is in effect the study of covariances, one of whose major
purposes is and identify common sources of variation. The contemporary ulti-
to isolate
mate most powerful and advanced multivariate approach yet devised, is
in analysis, the
called analysis of covariance structures because the system studies complex sets of rela-
tion by analyzing the covariances among variables. Variances and covariances will obvi-
ously be the core of much of our discussion and preoccupation from this point on.
Study Suggestions
1. A social psychologist has done an experiment in which one group. A\. was given a task to
do in the presence of an audience, and another group. Ai, was given the same task to do without an
audience. The scores of the two groups on the task, a measure of digital skill, were:
A, A.
5 3
5 4
9 7
8 4
3 2
(a) Calculate the means and variances of A, and Ai. using the method described in the text.
(b) Calculate the between-groups variance, V/„ and the within-groups variance. V,,..
(c) Arrange all ten scores in a column, and calculate the total variance. V,.
(d) Substitute the calculated values obtained in (b) and (c). above, in the equation: V, =
Vfc -I- V,,,. Interpret the results.
[Answers: (a) V.„ = 4.8: V,,. = 2.8: (b) V^ = 1.0; V,, = 3.8: (c) V, = 4.8.]
2. Add 2 to each of the scores of A, in 1. above, and calculate V„ V/,. and V„. Which of these
variances changed? Which stayed the same? Why?
[Answers: V, = 7.8: V^ = 4.0: V,, = 3.8.)
3. Equalize the means ofA^ and/lj. in 1, above, by adding aconstant of 2 to each of the scores
Variance and Covariance • 85
of i4i. Calculate V',, VV. and V„. What is the main difference between these results and those of I,
above? Why?
4. Suppose a sociological researcher obtained measures of conservatism {A), attitude toward
religion (fi), and anti-Semitism (O from 1(K) individuals. The correlations between the variables
were: r,,,, = .70; r,„ = .40; r,,, = .30. What do these correlations mean? [Hint: Square the r's before
trying to interpret the relations. Also, think of ordered pairs.)
5. The purpose of this study suggestion and Study Suggestion 6 is to give the student an
intuitive feeling for the variability of sample statistics, the relation between population and sample
variances, and beiween-groups and error variances. Appendix C contains 40 sets of 100 random
numbers through KX), with calculated means, variances, and standard deviations. Draw 10 sets of
10 numbers each from 10 different places in the table.
a. Calculate the mean, variance, and standard deviation of each of the 10 sets. Find the
highest and lowest means and the highest and lowest variances. Do they differ much
from each other? What value "should" the means be? (50) While doing this, save the 10
totals and calculate the mean of all 100 numbers. Do the 10 means differ much from the
total mean? Do they differ much from the means reported in the table of means, vari-
d. Discuss the meaning of your results after reviewing the discussion in the text.
6. As early as possible in their study, students of research should start to understand and use the
computer. Study Suggestion 5 can be better and less laboriously accomplished with the computer. It
would be better, for example, to draw 20 samples of 100 numbers each. Why? In any case, students
should learn how to do simple statistical operations using existing computer facilities and programs
at their institutions. All institutions have programs for calculating means and standard deviations
(variances can be obtained by squaring the standard deviations") and for generatingrandom num-
bers. If you can use your them for Study Suggestion 5, but increase the
institution's facilities, use
number of samples and their n's. You can have much more fun by writing your own program to
calculate simple statistics. For guidance —
and even programs —
see Lohnes and Cooley's very use-
ful elementary statistics book.'''
'-There may be small discrepancies between your hand-calculated standard deviations and variances and
those of the computer because existing programs and built-in routines of hand-held calculators usually use a
formula with - 1 rather than in the denominator of the formula. The discrepancies will be small, however,
N N
especially if ^V is large. (The reason for the different formulas will be explained later when we take up sampling
and other matters.)
'•p. Lohnes and W. Cooley, Introduction to Statistical Procedures: With Computer Exercises. New York:
WUey, 1968.
PART THREE
RANDOMNESS, AND
SAMPLING
Chapter 7
:•'/,
m Probability
a subject we know a great deal about, and a subject we know nothing about. Kinder-
gartners can study probability, and philosophers do. It is dull; it is interesting. Such
contradictions are the stuff of probability.
Take the expression "laws of chance." The expression itself is contradictory. Chance
or randomness, by definition, is the absence of law. If events can be explained lawfully,
they are not random.Then why say "laws ofchance"? The answer, too, is contradictory
seemingly. It is knowledge from ignorance if we view randomness as
possible to gain
ignorance. This is because random events, in the aggregate, occur in lawful ways with
monotonous regularity. From the disorder of randomness the scientist welds the order of
scientific prediction and control.
It is not easy to explain these disconcerting statements. Indeed, philosophers disagree
on the answers. Fortunately there is no disagreement on the empirical probabilistic
events — or at least very little. Almost all scientists and philosophers will agree that if two
dice are thrown a number of times, there will probably be more sevens than twos or
twelves. They will also agree that certain events like finding a hundred-dollar bill or
winning a sweepstakes are extremely unlikely.
90 • Probability, Randomness, and Sampling
DEFINITION OF PROBABILITY
What is probability? We ask this question and immediately strike a perplexing problem.
Philosophers cannot seem to agree on the answer. This seems to be because there are two
'
broad definitions, among others, which seem irreconcilable; the a priori and the a poste-
riori. The a priori definition we owe to a controversial, interesting, and very human
genius. Simon Laplace." The probability of an event is the number of favorable cases
divided by the total number of (equally possible) cases, or p = f/(f + u). where p is
/the number of favorable cases, and ii the number of unfavorable cases. The
probability.
method of calculating probability implied by the definition is a priori in the sense that
probability is given, that we can determine the probabilities of events before empirical
investigation. This definition is the basis of theoretical mathematical probability.
The a posteriori, or frequency, definition is empirical in nature. It says that, in an
actual series of tests, probability is the ratio of the number of times an event occurs to the
total number of trials. With one approaches probability empirically by
this definition,
performing a series of tests, counting the number of times a certain kind of event happens,
and then calculating the ratio. The result of the calculation is the probability of the certain
kind of event. Frequency definitions have to be used when theoretical enumeration over
classes of events is not possible. For example, to calculate longevity and horse race
probabilities one has to use actuarial tables and calculate probabilities from past counts
and calculations.-^
Practically speaking and for our purposes, the distinction between the a priori and a
posteriori definition is not too vital. Following Margenau (p. 264). we put the two to-
gether by saying that the a priori approach supplies a constitutive definition of probability,
whereas the a posteriori approach supplies an operational definition of probability. We
need to use both approaches; we need to supplement one with the other.
'For discussions of the disagreement, seeJ. Kemeny, A Philosopher Looks at Science. New York: Van
Nostrand Reiniiold, 1959. chaps. 4, Margenau. The Nature of Physical Reality New York: McGraw-Hill.
II. H .
1950, chap. 13. W. Salmon, The Foundations of Scientific Inference. Pittsburgh: University of Pittsburgh Press.
1966. chap. V. (Salmon presents five interpretations of probability!)
"For a good brief discussion of Laplace and his work, see J, Newman. The World of Mathematics, vol. 2.
New York: Simon and Schuster. 1956, pp. 1316-1324. For Laplace's own definition of probability, see ibid.,
pp. 1325-1333. Discussions of the two kinds of definitions are given by Margenau, op. cit.. chap. 13. (Laplace
was famous for using that exasperating expression of mathematicians and statisticians, "It is easy to see
that ....'
'M. Turner. Philosophy and the Science of Behavior. New York: Appleton-Cenlury-Crofls. 1967, p. 384.
Probability • 91
Figure 7.1
listing all the members of the set, and by giving a rule for the inclusion of elements in a
set. In probability theory, both kinds of definition are used. What is U in tossing two
coins? We list all the possibilities: U= {(//, //), {H, T), (T, H), (T, T)}. This is a list
definition of U. A rule definition — although we would not use — might it be: U= {x; x is
all combinations of// and T}. In this case U is the Cartesian product. Let A| = {//,, T\},
the first coin; let /i2 = {Hz^ T^}, the second coin. Recalling that a Cartesian product of two
sets is the set of all ordered pairs whose first entry is an element of one set and whose
second entry is an element of another set, we can diagram the generation of the Cartesian
product of this case. Ax x At, as in Figure 7.1. Notice that there are four lines connecting
Ai and A2. Thus there are four possibilities: {(//,, H2), (//|, T2), (Ti, H2), (Ti, T2)}. This
thinking and procedure can be used in defining many sample spaces of f/'s, although the
actual procedure can be tedious.
With two dice, what is t/? Think of the Cartesian product of two sets and you will
probably have little trouble. Let A; be the outcomes or points of the first die:
{1, 2, 3, 4, 5, 6}. Let A^ be the outcomes or points of the second die. Then t/ = A, x
A2 = {(1, I), (1,2) (6, 5), (6, 6)}. We can diagram this as we diagrammed the coin
example, but counting the lines is more difficult; there are too many of them. We can
know the number of possible outcomes simply by 6 x 6 = 36, or in a formula: mn, where
m is the number of possible outcomes of the first set, and n is the number of possible
outcomes of the second set.
It is often possible to solve difficult probability problems by using trees. Trees define
sample spaces, logical possibilities, with clarity and precision. A tree is a diagram that
gives all possible alternatives or outcomes for combinations of sets by providing paths and
set points. This definition is a bit unwieldy. Illustration is better. Take the coin example
(we turn the tree on its side). Its tree is shown in Figure 7.2.
To determine the number of possible alternatives, just count the number of alternatives
or points at the "top" of the tree. In this case, there are four alternatives. To name the
alternatives, read off, for each end point, the points that led to it. For example, the first
alternative is (//|, //2). Obviously, three, four, or more coins can be used. The only
trouble is that the procedure is tedious because of the large number of alternatives. The
tree for three coins is illustrated in Figure 7.3. There are eight possible alternatives,
outcomes, or sample points: U = {{Hi, H2. H3), {Hi, Hj, Ts), ,(7"i, T2. Ty,)}. (The
. . .
Start
Figure 7.2
Sample points of a sample space may seem a bit confusing to the reader, because two
kinds of points have been discussed without differentiation. Another term and its use may
help clear up this possible confusion. An event is a subset of U. Any element of a set is
also a subset of the set. Recall that with set A = {a^. 02}, for example, both {ax} and {a^}
are subsets of A, as well as {a,, 02}, and { }, the empty set. Identically, all the outcomes of
Figures 7.2 and 7.3, for example, (//|, T2). {T^, H2). and {T^. H2 T^). are subsets of their
respective L''s. Therefore they are events, too — by definition. But in the usual usage,
events are more encompassing than points. All points are events (subsets), but not all
events are points. Or, a point or outcome is a special kind of event, the simplest kind. Any
time we state a proposition, we describe an event. We ask, for instance, "If two coins are
thrown, what is the probability of getting two heads?" The "two heads" is an event. It
Start
Figure 7.3
Probability • 93
so happens, in this case, that it is also a sample point. But suppose we asked, "What is
the prohability ot' getting at least one head'.'"" ""At least one head"' is an event, but not a
sample point, because it includes, in this case, three sample points: (W|, //t). (W|, Ti),
and (/"|. W,). (See Figure 7.2.)
case of 3 heads, one case of 3 tails, three cases of 2 heads and tail, and three cases of 1
2 tails and 1 head. The probability of each of the eight outcomes is obviously 1/8. Thus
the probability of 3 heads is 1/8, and the probability of 3 tails is 1/8. The probability of
2 heads and 1 tail, on the other hand, is 3/8, and similarly for the probability of 2 tails and
1 head.
The probabilities of all the points in the sample space must add up to 1 .00. It also
follows that probabilities are always positive. If we write a probability tree for the three-
toss experiment, looks like Figure 7.3. Each complete path of the tree (from the start to
it
the third toss) is sample point. All the paths comprise the sample space. The single path
a
sections are labeled with the probabilities, in this case all of them are labeled with "1/8."
This leads naturally to the statement of a basic principle: If the outcomes at the different
points in the tree, that is, at the first, second, and third tosses, are independent of each
other (that is, if one outcome does not influence another in any way), then the probability
of any sample point (HHH perhaps) is the product of the probabilities of the separate
outcomes. For example, the probability of 3 heads is 1/2 x 1/2 x 1/2 = 1/8.
Another principle is: To obtain the probability of any event, add the probabilities of
the sample points that comprise that event. For example, what is the probability of tossing
2 heads and 1 tail? We look at the paths in the tree that have 2 heads and 1 tail. There are
3 of them. (They are checked in Figure 7.3.) Thus, 1/8 + 1/8 + 1/8 = 3/8. In set lan-
guage, we find the subsets (events) of U and note their probabilities. The subset of U of
the type "2 heads and tail"" are, from the tree or the previous definition of U,
1
{(//, //, 7"), (//, 7", H). (T. H. //)}. Call this the set A,. Then p(AO = 3/8.
This procedure can be followed with an experiment of 100 tosses, but it is much too
laborious. Instead, to get the theoretical expectations, we merely multiply the number of
tosses by the probability of any one of them, 100 x 1/2 = 50, to get the expected number
of heads (or tails). This can be done because all the probabilities are the same. A big and
important question is: In actual experiments in which we throw 100 coins, will we get
exactly 50 heads? No, not often: about 8 times in 100 such experiments. This can be
written: p = 8/100 or .08. (Probabilities can be written in fractional or decimal forms,
more usually in decimal form.)
1 + 1 and 6 + 6, but there are three ways for a 4 to turn up: 1 + 3, 3 + 1 , and 2 + 2. If
this is so, then the probabilities for getting different sums must be different. The game of
craps is based on these differences in frequency expectations.
To solve the a priori probability problem, we must first define the sample space:
[/={(!, 1), (1, 2), (1, 3), . . . ,(6, 4), (6, 5), (6, 6)}. That is, we pair each number of
the first die with each number of the second die in turn (the Cartesian product again). This
can easily be seen if we set up this procedure in a matrix (see Table 7.1). Suppose we want
to know the probability of the event — a very important event, too
up." — "a 7 turns
Simply count the number of 7's in the table. There are six of them nicely arrayed along the
center diagonal. There are 36 sample points in (J obtained by some method of enumerat-
,
ing them, as above, or by using the formula mn. which says: Multiply the number of
possibilities of the first thing by the number of possibilities of the second thing. This
method can be defined: When there are m ways of doing something. A, and n ways of
doing something else, B, then, if the n ways of doing B are independent of the m ways of
doing A, there are m n ways of doing both A and B.^
Sum of Dice 2 3 4 5 6 7 8 9 10 11 12
A(36)
/,(72) 2
345654321
12
4 6 8 10 12 10 S 6 4 2
/„(72) 4 2 6 6 10 15 7 11 6 4 1
This is a function notion; ir is called a weight function. It is a rule that assigns weights
to elements of a set, U. in such a way that the sum of the weights is equal to I, that is,
W) + W2 + M'3 + • • •
+ H'n = 1.00, and H', = 1/n. The weights are equal, assuming equi-
probability; each weight is a fraction with I in the numerator and the number of cases, n,
in thedenominator. In the previous experiment of the tosses of a coin (Figure 7.3), the
weights assigned to each element of U. U being all the outcomes, are all 1/8. The sum of
all the weight functions, h'(.v), is 1/8 + 1/8 + + 1/8 = I. In probability theory, the • • •
Zj w(x} or Zj w(x)
x in V -X in A
We write m(A). meaning "The measure of the set A." This simply means the sum of the
weights of the elements in the set A.
Suppose we randomly sample children from the 400 children of the fourth grade of a
school system. Then U is all 400 children. Each child is a sample point of U. Each child is
an -v in U. The probability of selecting any one child at random is 1/400. Let A = the boys
in U, and B = the girls in U. There are 100 boys and 300 girls. Each boy is assigned the
^The approach used here follows to some extent that found in J. Kenieny. J. Snell. and G. Thompson,
Introduction to Finite Matliematics. 2d ed. Englewood Cliffs. N.J.: Prentice-Hall, 1966, chap. IV.
'Note that the sum of the weights in a subset A oi U does not have to equal 1 . In fact, it is usually less
than 1.
96 • Probability, Randomness, and Sampling
weight 1/400, and each girl is assigned the weight 1/400. Suppose we wish to sample, all
together, 100 children. Our expectation is, then, 25 boys and 75 girls in the sample. The
measure of the set A, m{A). is the sum of the weights of all the elements in A. Since there
are 100 boys in U, we sum the 100 weights: 1/400 + 1/400 + + 1/400 = •
100/400 = 1/4, or
m(A) = S vvf.xj = —
X in A
Similarly,
3_
m(B) = S w(x)
4
X in B
For the set 6, the girls, we sum 300 weights, each of them being 1/400. In short, the sums
of the weights are the probabilities. That is, the measure of a set is the probability of a
member of the set being chosen. Thus we can say that the probability that a member of the
sample of 400 children will be a boy is 1/4, and the probability that the selected member
will be a girl is 3/4. To determine the expected frequencies, multiply the sample size by
these probabilities: 1/4 x 100 = 25 and 3/4 x 100 = 75.
Probability has three fundamental properties:
1 The measure of any set, as defined above, is greater than or equal to and less than or equal
to 1. In brief, probabilities (measures of sets) are either 0. 1, or in between.
2. The measure of a set, m(A), equals if and only if there are no members in A, that is, A
isempty.
3. Let A and B be sets. If A and B are disjoint, that is. A Ci B = E, then:
This equation says that when no members of A and B are shared in common, then the
probability of either A or B or both is equal to the combined probabilities of A and B.
There is no need to give an example to illustrate (1). We have had several earlier. To
illustrate (2), assume, in the boys-girls example, that we asked the probability of drawing
a teacher in the sample. But U did not include teachers. Let C be the set of fourth-grade
teachers. In this case, the set C is empty, and in{C) = 0. Use the same boys-girls example
to illustrate (3). Let A be the set of boys, B the set of girls. Then m{A U fi) = ni{A) +
m(B). But m(A U = 1.00, because they were the only subsets of U. And we learned
i5)
answers to such questions as: What is the probability of a female white Anglo-Saxon
Protestant being listed in Who's Who in America'^ or What is the probability of a black
male holding a high rank in the Civil Service?**
Compound events are more interesting than single events — and more useful in re-
search. Relations can be studied with them. To understand this, we first define and
illustrate compound events and then examine certain counting problems and the ways in
which counting is related to set theory and probability theory. It will be found that if the
basic theory is understood, the application of probability theory to research problems is
Recall that earlier the frequency definition of probability was given as:
/
(7.2)
^ /+«
where /is the number of favorable cases, and u the number of unfavorable cases. The
numerator is n{F) and the denominator n(U). the total number of possible cases. Simi-
larly, we can divide through the terms of Equation 7.1 by /?([/):
Using the example of the 100 children, and substituting values in Equation 7.3, we get
100 60 40
100 100 100
In many cases, two (or more) sets in which we are interested are not disjoint. Rather,
they overlap. When this is so, then A r\ B^^ E, and it is not true that n{A U B) = n(A) +
n(B). Look at Figure 7.4.
Here A and B aresubsets of U\ sample points are indicated by dots. The number of
sample points in A is 8; the number in B is 6. There are two sample points in y4 fl 6. Thus
the equation above does not hold. If we calculate all the points in A U fi with Equation
^S. Lieberson and D. Carter, "Making It in America: Differences between Eminent Black and White
Ethnic Groups," American Sociological Review. 44 (1979), 347-366.
*K. Meier. "Representative Bureaucracy: An Empirical Analysis," American Political Science Review,
69 (1975), 526-542.
98 • Probability, Randomness, and Sampling
Figure 7.4
7.1, we get 8 + 6 = 14 points. But there are only 12 points. The equation has to be
altered to a more general equation that fits all cases:
It should be clear that the error when Equation 7.1 is used results from counting the
two points of y4 n fl twice. Therefore we subtract n(A n B) once, which corrects the
equation. It now fits any possibility. If, for example, n(A H B) = E, the empty set.
Equation 7.5 reduces to (7.1). Equation 7.1 is a special case of (7.5). Calculating the
number of sample points in A US
of Figure 7.4. then, we get: n{A UB) = 8 + 6 — 2 =
12. If we divide Equation 7.5 through by n(U), as in (7.3):
^
24
_ _8_
~ 24
_6
24
2_
24
let fl = {1, 2, 3, 4, 5, 6}. If we and throw a die together, what are the possi-
toss a coin
bilities? Unless all the possibilities are we cannot solve the probability prob-
exhausted,
lem. There arc 12 possibilities (2 x 6). The sets A and B exhaust the sample space. (This
is of course obvious, since /\ andB generated the sample space.) Now take a more realistic
which says that the probability of A and B both occurring is equal to the probability of A
times the probability of S. Easy and clear examples of independent events are dice throws
and coin tosses. If A is the event of a die throw and B is the event of a coin toss, and
p{A) = 1/6 and p(B) = p{B) = 1/6 1/2 = 1/12, A and B are indepen-
1/2, then, if /7(A) • •
dent. If we one toss has no influence on any other toss. The tosses are
toss a coin 10 times,
independent. So are the throws of dice. Similarly, when we simultaneously throw a die
and toss a coin, the events of throwing a die. A, and tossing a coin, B, are independent.
The outcome of a die throw has no influence on the toss of a coin and vice versa. —
Unfortunately, this neat model does not always apply in research situations.
The commonsense notion of the so-called law of averages is utterly erroneous, but it
number of occurrences of an event, then the chance of that event occuring on the next trial
is smaller. Suppose a coin is being tossed. Heads has come up five times in a row. The
commonsense notion of the "law of averages" would lead one to believe that there is a
greater chance of getting tails on the next toss. Not so. The probability is still 1/2. Each
toss is an independent event.
Suppose students in a college class are takingan examination. They are working under
the usual conditions of no communication, no looking at each other's papers, and so forth.
The responses of any student can be considered independent of the responses of any other
student. Can the responses to the items within the test be considered independent? Sup-
pose that the answer to one item later in the test is embedded in an item earlier in the test.
The probability of getting the later item correct by chance, say, is 1/4. But the fact that the
answer was given earlier can change this probability. With some students it might even
become 1.00. What is important for the researcher to know is that independence is often
difficult to achieve and that lack of independence when research operations assume inde-
pendence can seriously affect the interpretation of data.
Suppose we rank order examination papers and then assign grades on the basis of the
ranks. This is a perfectly legitimate and useful procedure. But it must be realized that the
grades given by the rank-order method are not independent (if they ever could be). Take
five such papers. After reading them one is ranked as the first (the best), the second next,
and so on through the five papers. We assign the number "1" to the first, "2" to the
second, "3" to the third, "4" to the fourth, and 5"to the fifth. After using up 1, we
have only 2,3,4, and 5 left. After using up 2, only 3, 4, and 5 remain. When we assign 4,
obviously we must assign 5 to the remaining examination. In short, the assignment of 5
—
was influenced by the assignment of 4 and also 1.2, and 3. The assignment events are
not independent. One may ask, "Does this matter?" Suppose we take the ranks, treat
them as scores, and make inferences about mean differences between groups, say between
two classes. The statistical test used to do this is probably based on the coin-dice paradigm
with its pristine independence. But we have not followed this model —
one of its most
important assumptions, independence, has been ignored.
When research events lack independence, statistical tests lack a certain validity. A ^c
test, for example, assumes that the events —
responses of individuals to an interview
question, say —recorded in the cells of a crossbreak table, are independent of each other.
If the recorded events are not independent of each other, then the basis of the statistical
test and the inferences drawn from it are corrupted.
and scratches were inflicted by females! Then Hebb and Thompson pursued the interest-
ing, if disconcerting, idea of tabulating incidence of aggressive acts in two ways: when
such were preceded by quasi-aggression, that is. by warning of attack, and when aggres-
sive acts were preceded by friendly behavior. The resulting incidences of behavior are
given in Table 7.3. The table seems to indicate: "Watch out for female apes when they are
friendly!"
Hehb and W. Thompson. "The Social Significance of Animal Studies." In G. Lindzey and E. Aron-
'D.
son, eds.,7/it' Handbook of Social Psychology. 2d ed., vol. 2. New York: Random House, Inc.. 1968,
pp.
129-11 A. The table is on p. 751.
Probability • 101
ity of both events occurring by chance. If it is found that dice repeatedly show 12"s, say,
then there probably something wrong with the dice. If a gambler notes that another
is
gambler seems always to win, he wil| of course get suspicious. The chances of continually
winning a fair game are small. It can happen, of course, but it is unlikely to happen. In
research, it is unlikely that one would get two or three significant results by chance.
Something beyond chance is probably operating —
the independent variable, we hope.
Two, the formula can be turned around, so to speak. It can tell the researchers what he
must do to allow himself the advantage of the multiplicative probabilities. He must, if it
is at all possible, plan his research so that events are independent. That this is easier said
than done will become quite evident before this book is finished.
CONDITIONAL PROBABILITY
In all research and perhaps especially in social scientific and educational research, events
are often not independent. Look at independence in another way. When two variables are
related they are not independent. Our previous discussion of sets makes it clear; if A H
B = E. then there is (more accurately, a zero relation), or A and B are indepen-
no relation
dent: if A n B # £. then there is a relation, or A and B are not independent. When events
are not independent, scientists can sharpen their probabilistic inferences. The meaning of
this statement can be explicated to some extent by studying conditional probability.
When events are not independent, the probability approach must be altered. Here is a
simple example. What is the probability that, of any married couple picked at random,
both mates are Republicans? First, assuming equiprobability and that everything else is
equal, the sample space U (all the possibilities) is {RR. RD, DR. DD], where the husband
comes first in each possibility or sample point. Thus the probability that both husband and
wife are Republicans is p{RR) = 1/4. But suppose we know that one of them is a Republi-
can. What is the probability of both being Republicans now? U is reduced to {RR. RD,
DR}. The knowledge that one is a Republican deletes the possibility DD. thus reducing the
sample space. Therefore. p(RR) =1/3. Suppose we have the further information that the
wife is a Republican. Now, what is the probability that both mates are Republicans? Now
U = {RR. DR}. Thus p(RR) = 1/2. The new probabilities are, in this case, "conditional"
on prior knowledge or facts.
Let A B
be events in the sample space, U. as usual. The conditional probability is
and
denoted: P{A B). which is read, "The probability of A. given B." For example, we
\
might say. "The probability that a husband and wife are both Republicans, given that the
husband is a Republican." or, much more difficult to answer, though more interesting,
"The probability of high effectiveness in college teaching, given the Ph.D. degree." Of
course, we can write /7(S |
A), too. The formula for the conditional probability involving
two events is:"
P(^ n B)
= ~
p{A B) (7.8)
p(B)
"The theory extends to more than two events, but will not be discussed in this book.
Probability • 103
thesample space. The sample space has. through knowledge, been cut down from U to B.
To demonstrate this point take two examples, one of independence or simple probability
and one of dependence or conditional probability.
Toss a coin twice. The events are independent. What is the probability of getting heads
on the second toss if heads appeared on the first toss? We already know: 1/2. Let us
calculate the probability using Equation 7.8. First we write a probability matrix (see Table
7.4). For the probabilities of heads (H) and on the first toss, read the marginal
tails (71
entrieson the right side of the matrix. Similarly for the probabilities of the second toss:
they are on the bottom of the matrix. Thus p{H\) =1/2, p{H2) =1/2, and p^Hi H
//,) = 1/4. Therefore,
p(H2 1
//,) =
104 • Probability, Randomness, and Sampling
An Academic Example
There are more interesting examples of conditional probability than coins and other such
chance devices. Take the baffling and frustrating problem of predicting the success of
doctoral students in graduate school. Can the coin-dice models be used in such a complex
situation? Yes — under certain conditions. Unfortunately, these conditions are difficult to
arrange. There is some limited success, however. Provided that we have certain empirical
information, the model can be quite useful. Assume that the administrators of a graduate
school are interested in predicting the success of their doctoral students. They are dis-
tressed by the poor performance of many of their graduates and want to set up a selection
system. The school continues to admit all doctoral applicants as in the past, but for three
years all incoming students take the Miller Analogies Test (MAT), a test that has been
found to be fairly successful in predicting doctoral success. An arbitrary cutoff point of a
raw score of 65 is selected.
The school administration finds that 30 percent of all the candidates of the three-year
period score 65 or above. Each is categorized as a success (s) or failure (/). The criterion
is simple: Does he or she get the degree? If so, this is defined as success. It is found that
40 percent of the total number succeed. To determine the relation between MAT score and
success or failure, the administration, again using a cutoff point of 65, determines the
proportions shown in Table 7.6.
Success (i)
s 65
<65
Probability • 105
Maybe the toliowing mode of looking at the problem will help. An area interpretation
of the graduate-student problem is diagrammed in Figure 7.5. The idea of a measure of a
set is used here. Reeall that a measure of a set or subset is the sum of the weights of the
set or subset. The weights are assigned to the elements of the set or subset. Figure 7.5 is
a square with ten equal parts on eaeh side. Each part is equal to 1/10 or 10. The area of .
the whole square is the sample space U, and the measure of U. m{U). equals 1.00. This
means that all the weights assigned to all the elements of the square add up to .00. The 1
measures of the subsets have been inserted: m{F) = .60, w(<65) = .70, m(S D ^
65) = .20. The measures of these subsets can be calculated by multiplying the lengths of
their sides. For example, the area of the upper left (doubly hatched) box is .5 x .4 = .20.
Recall that the probability of any set (or subset) is the measure of the set (or subset). So
the probability of any of the boxes in Figure 7.5 is as indicated. We can find the probabil-
ity of any two boxes by adding the measures of sets; for example, the probability of
The areas of the "success" and "failure" measures are indicated by the heavy lines
separating them on the square.
Our conditional probability problem is; What is the probability of success, given
knowledge of MAT scores, or given ^ 65 (it could also be <65. of course)? We have a
new small sample space, indicated by the whole shaded area at the top of the square. In
effect U has been cut down to this smaller space because we know the "truth" of the
smaller space. Instead of letting this smaller space be equal to .30, we now let it be equal
to 1.00. (You might say it becomes a new U.) Consequently the measures of the boxes
that constitute the new sample space must be recalculated. For instance, instead of calcu-
lating the probability of p{ & 65 fl 5) = .20 because it is 2/10 of the area of the whole
Success
<65
106 • Probability, Randomness, and Sampling
square, we must calculate, since we now know that the elements in the set &
65 do have
MAT scores greater than or equal to 65, the probability on the basis of the area of & 65
(the whole shaded area at the top of the square). Having done this, we get .20/. 30 = .67,
which is exactly what we got when we used Equation 7.8.
What happens is that additional knowledge makes U no longer relevant as the sample
space. All probability statements are relative to sample spaces. The basic question, then,
is that of adequately defining sample spaces. In the earlier problem of husbands and
wives, we asked the question: What is the probability of both mates being Republican?
The sample space was U = {RR. RD, DR. DD}. But when we add the knowledge that one
of them is certainly a Republican and ask the same question, in effect we make the
original U irrelevant to the problem. A new sample space, call it U' is required. Conse- .
quently the probability that both are Republicans is different when we have more knowl-
edge.
We can calculate other probabilities similarly. Suppose we wanted to know the proba-
bility of failure, given an MAT score less than 65.
Look at Figure 7.5. The probability we
want is the box on the lower right, labeled .50. Since we know that the score is <65, we
use this knowledge to set up a new sample space. The two lower boxes whose area equals
.20 .50 = .70 represent this sample space. Thus we calculate the new probability:
-I-
.50/. 70 = .71. The probability of failure to get the degree if one has an MAT score less
than 65 is .71.
Study Suggestions
1. Suppose that you are sampling ninth-grade youngsters for research purposes. There are 250
ninth graders in the school system, 130 boys and 120 girls.
2. Toss a coin and throw a die once. What is the probability of getting heads on the coin and a
six on the die? Draw a tree to show all the possibilities. Label the branches of the tree with the
appropriate weights or probabilities. Now answer some questions. What is the probability of get-
ting:
3. Toss a coin and 72 times. Write the results side-by-side on a ruled sheet as they
roll a die
Figure 7.6
selected at random, what are the probabilities that the element will be in
(a) A? (d) A U B?
(b) B? (e) m
(c) AHBl
[Answers: (a) 2/5; (b) 1/5; (c) 1/5; (d) 2/5; (e)l.]
(a) Given A (knowing that a sampled element came from A), what is the probability of B?
(b) Given 6, what is the probability of A?
[Answers: (a) 1/2; (b) 1.]
8. Suppose one had a two-item four-choice multiple-choice test, with the four choices of each
item labeled a. b, c, and d. The correct answers to the two items are c and a.
(a) Write out the sample space. (Draw a tree; see Figure 7.3.)
(b) What is the probability of any testee getting both items correct by guessing?
(c) What is the probability of getting at least one of the items correct by guessing? (Him:
This may be a bit troublesome. Draw the tree and think of the possibilities. Count them.)
(d) What is the probability of getting both items wrong by guessing?
(e) Given that a testee gets the first item correct, what is the probability of him getting the
second item correct by guessing?
[Answers: (b) 1/16; (c) 7/16; (d) 9/16; (e) 1/4.]
9. Most of the discussion in the text has been based on the assumption of equiprobability. This
assumption is often not justified, however. What is wrong with the following argument, for in-
stance? The probability of one's dying tomorrow is one-half. Why? Because one will either die
Figure 7.7
108 • Probability, Randomness, and Sampling
tomorrow or not die tomorrow. Since there are two possibilities, they each have a probability of
occurrence of one-half. How would insurance companies fare with this reasoning? Suppose, now,
that a political scientist studied the relation between religious and political preferences, and assumed
that the probabilities that a Catholic was Democrat or Republican were equal. What would you think
of his research results? Do these examples have implications for researchers knowing something of
the phenomena they are studying? Explain.
Chapter 8
Sampling and
Randomness
Imagine the many situations in which we want know something about people, about
to
events, about things. To we take some few
learn something about people, for instance,
people whom we know —or do not know —
and study them. After our "study," we come
to certain conclusions, often about people in general. Some such method is behind much
folk wisdom. Commonsensical observations about people, their motives, and their behav-
iors derive, for the most part, from observations and experiences with relatively few
people. We make such statements as: "People nowadays have no sense of moral values";
"Politicians are corrupt"; and "Public school pupils are not learning the three R"s."
The basis for making such statements is simple. People, mostly through their limited
experiences, come to certain conclusions about other people and about their environment.
In order to come to such conclusions, they must sample their "experiences" of other
people. Actually, they take relatively small samples of all possible experiences. The term
"experiences" here has to be taken in a broad sense. It can mean direct experience with
other people —
for example, first-hand interaction with, say, Germans or Jews. Or it can
mean indirect experience; hearing about Germans or Jews from friends, acquaintances,
parents, and others. Whether experience is direct or indirect, however, does not concern
us too much at this point. Let us assume that all such experience is direct. An individual
believes he "knows" something about Jews and says he "knows" they are clannish,
because he has had direct experience with a number of Jews. He may even say, "Some of
my best friends are Jews, and I know that ..." The point is that his conclusions are
based on a sample of Jews, or a sample of the behaviors of Jews, or both. He can never
"know" all Jews; he must depend, in the last analysis, on samples. Indeed, most of the
world's knowledge is based on samples, most often on inadequate samples.
110 • Probability. Randomness, and Sampling
'
W. Feller. An Imroduclion lo Probability Theorv and Its Applications. 2nd ed. New York: Wiley. 1957,
p. 29.
Sampling and Randomness • 111
two dice a number of times, the probability of a 7 turning up is greater than that of a 12
turning up. (See Table 7.1.)
A sample drawn at random is unbiased in the sense that no member of the population
has any more chance of being selected than any other member. We have here a democracy
in which all members are equal before the bar of selection. Rather than using coins or
dice, let's use a research example. Suppose we have a population of 100 children. The
children differ in intelligence, a variable relevant to our research.We want to know the
mean intelligence score of the population, but for some reason we can only sample 30 of
the 100 children. If we sample randomly, there are a large number of possible samples of
30 each. The samples have equal probabilities of being selected. The means of most of the
samples will be relatively close to the mean of the population. A few will not be close.
The probability of selecting a sample with a mean close to the population mean, then, is
greater than the probability of selecting a sample with a mean not close to the population
—
mean if the sampling has been random.
If we do not draw our sample at random, however, some factor or factors unknown to
us may predispose us to select a biased sample, in this case perhaps one of the samples
with a mean not close to the population mean. The mean intelligence of this sample will
then be a biased estimate of the population mean. If the 100 children were known to us,
might unconsciously tend to select the more intelligent children. It is not so much that
oitld do so; it is that our method allows us to do so. Random methods of selection do
not allow our own biases or any other systematic selection factors to operate. The proce-
dure is from our own predilections and biases.
objective, divorced
The reader may be experiencing a vague and disquieting sense of uneasiness. If we
can't be sure that random samples are representative, how can we have confidence in our
research results and their applicability to the populations from which we draw our sam-
ples? Why not select samples systematically so that they are representative? The answer is
complex. First —
and again —
we cannot ever be sure. Second, random samples are more
likely to include the characteristics typical of the population if the characteristics are
frequent in the population. In actual research, we draw random samples whenever we can
and hope and assume that the samples are representative. We learn to live with uncer-
tainty, but try to cut it down whenever we can — just as we do in ordinary day-to-day
"D. Stilson. Probability and Statistics in Psychological Research and Theory. San Francisco: Holden-Day,
1966. p. 35.
112 • Probability, Randomness, and Sampling
RANDOMNESS
The notion of randomness is at the core of modem probabilistic methods in the natural and
behavioral sciences. But it is "random." The dictionary notion of
difficult to define
haphazard, accidental, without aim or direction, does not help us much. In fact, scientists
are quite systematic about randomness; they carefully select random samples and plan
random procedures.
The position can be taken that nothing happens at random, that for any event there is
a cause. The only reason, this position might say, that one uses the word random is that
human beings do not know enough. To omniscience nothing is random. Suppose an
omniscient being has an omniscient newspaper. It is a gigantic newspaper in which every
^See J. Kemeny, A Philosopher Looks at Science. New York: Van Nostrand Reinhold. 1959. p. 39.
^Ibid.. pp. 68-75.
^^The source of random numbers used was: Rand Corporation, A Million Random Digits with 100.000
Normal Deviates. New York: Free Press, 1955. This is a large and carefully constructed table of random
numbers. There are many other such tables, however, that are good enough for most practical purposes. Modem
statistics texts have such tables. Appendix C at the end of this book contains 4.000 computer-generated random
numbers.
Sampling and Randomness • 113
the whole population of random numbers. And the number of even numbers in each
sample of 10 should be approximately equal to the number of odd numbers — though,
again, there will be tluctuations. some of them perhaps extreme but most of them compar-
atively modest. The samples are given in Table 8.1.
I : 3 4 5 6 7 8 9 10
7
114 • Probability, Randomness, and Sampling
RANDOMIZATION
Suppose an investigator wishes to test the hypothesis that counseling helps underachiev-
ers. He wants to set up two groups of underachievers, one to be counseled, one not to be
counseled. Naturally, he also wishes to have the two groups equal in other independent
variables that may have a possible effect on achievement. One way he can do this is to
assign the children to both groups at random by, say, tossing a coin for each child in turn
and assigning the child to one group if the toss is heads and to the other group if the toss
is tails. (Note that if he had three experimental groups he would probably not use coin-
tossing. He might use a die.) Or he can use a table of random numbers and assign the
children as follows; if an odd number turns up, assign a child to one group, and if an even
number turns up, assign the child to the other group. He can now assume that the groups
are approximately equal in all possible independent variables. The larger the groups, the
safer the assumption. Just as there no guarantee, however, of not drawing a deviant
is
sample, as discussed earlier, there is no guarantee that the groups are equal or even
approximately equal in all possible independent variables. Nevertheless, it can be said that
the investigator has used randomization to equalize his groups, or, as it is said, to con-
trol influences on the dependent variable other than that of the manipulated independent
variable.
An "ideal" experiment is one in which all the factors or variables likely to affect the
experimental outcome are controlled. we knew all these factors, in the first place, and
If
could control them, in the second place, then we might have an ideal experiment. But the
sad case is that we can never know all the pertinent variables nor could we control them
even if we did know them. Randomization, however, comes to our aid.
Randomization is the assignment to experimental treatments of members of a universe
in such a way any given assignment to a treatment, every member of the universe
that, for
has an equal probability of being chosen for that assignment. The basic purpose of random
assignment, as indicated earlier, is to apportion subjects (objects, groups) to treatments so
that individuals with varying characteristics are spread approximately equally among the
treatments so that variables that might affect the dependent variable other than the experi-
mental variables have "equal" effects is no guarantee
in the different treatments.^ There
that this desirable state of affairs will be attained, but more likely to be attained with
it is
randomization than otherwise. The idea of randomization seems to have been discovered
or invented by Sir Ronald Fisher, who virtually revolutionized statistical and experimental
design thinking and methods using random notions as part of his leverage.** In any case,
randomization and what can be called the principle of randomization is one of the great
intellectual achievements of our time. It is not possible to overrate the importance of both
the idea and the practical measures that come from it to improve experimentation and
inference.
'Randomization also has a statistical rationale and purpose. If random assignment has been used, then it
is possible to distinguish between systematic or experimental variance and error variance. Biasing variables
Hays calls them "nuisance" variables — are distributed to experimental groups according to chance. As Wendel.
in a letter to 1978, 199. p. 368). says "the biasing errors become random errors." Wendel also says
Science (
that the "equalization" function is secondary to the statistical function. Strictly speaking, the tests of statistical
significance that we will discuss later logically depend on random assignment. Without it the significance tests
lack logical foundation. (See following footnote.)
"See R. A. Fisher, The Desii^n of Experiments. New York: Hafner. 1951. Chap. U. This chapter begins
with Fisher's famous lady who said that by tasting a cup of tea she could tell whether the milk or the tea was first
added to the cup. He uses the example to illustrate the necessity and importance of randomization. The chapter is
a fine statement of the physical and statistical conditions of experiments.
Sampling and Randomness • 115
Randomization can perhaps be clarified in two or three ways: by stating the principle
ot randomization, by describing how one uses it in practice, and by demonstrating how it
worlds with objects and numbers. The importance of the idea deserves all three.
The principle of randomization may be stated thus: Since, in random procedures,
every member of a population has an equal chance of being selected, members with
certain distinguishing characteristics —
male or female, high or low intelligence, conserv-
ative or liberal, and so on and on will, if selected, probably be offset in the long run by
the selection of other members of the population with counterbalancing quantities or
qualities of the characteristics. We can say that this is a practical principle of what usually
happens; we cannot say that it is It is simply a statement of what most
a law of nature.
often happens when random procedures are used.
We say that subjects are assigned at random to experimental groups, and that experi-
mental treatments are assigned at random to groups. For instance, in the example cited
above of an experiment to test the effectiveness of counseling on achievement, subjects
can be assigned to two groups at random by using random numbers or by tossing a coin.
When the subjects have been so assigned, the groups can be randomly designated as
experimental and control groups using a similar procedure. We will encounter a number of
examples of randomization as we go along.
To show how, if not why, the principle of randomization works, we now set up a sampling
and design experiment. We have a population of 100 members of the United States Senate
from which we can sample. In this population (in 1981 ), there are 53 Republicans and 47
Democrats. I have selected two important votes, one (Issue 162) on aid programs for
welfare children and the other (Issue 121) on proposed reductions in Social Security
benefits.'' While these votes were important since each of them reflected presidential
proposals, a Nay vote on 162 and a Yea vote on 121 indicating support of the President,
we here ignore their substance and treat the actual votes, or rather, the senators who cast
the votes, as populations from which we sample.
We pretend we are going to do an experiment using three groups of senators, with 20
in each group. The nature of the experiment is not too relevant here, but let us say that we
want to test the efficacy of a film depicting the horrors of nuclear warfare in changing the
attitudes of the senators toward nuclear test bans. We
want the three groups of senators to
be approximately equal Using a programmable calculator-
in all possible characteristics.
computer I generated random numbers between 1 and 100. '° The first 60 numbers drawn,
with no repeated numbers (sampling without replacement), were recorded in groups of 20
each. Political party affiliation, 1 = Republican, = Democrat, and the senators" votes
on the two issues, 1 = Yea and = Nay, were also recorded.
How '"equal" are the groups? In the total population of 100 senators, 53 are Republi-
cans and 47 Democrats, or 53 percent and 47 percent. In the total sample of 60 there are
30 Republicans and 30 Democrats, or 50 percent each, a difference of 3 percent from the
expectation of 53 percent and 47 percent. The obtained and expected frequencies of
Republicans in the three groups and the total sample are given in Table 8.2. The devia-
^ Congressional Quarierh. 1981 (39), pp. 920 (No. 121) and 1156 (No. 162).
'"Hewlett-Packard HP-67. Hewlett-Packard HP-67IHP-97: Stat Pac I. pp. 04-01-04-05. This program is
based on a method described in: D. E. Knuth, The Art of Computer Programming, vol. 2. Reading, Mass.:
Addison- Wesley. 1971. In this chapter we have used three different methods of generating pseudo-random
numbers (as they are more properly called): drawing them from a random numbers table, generating them with a
programmable hand-held calculator, and generating them on a large computer.
116 • Probability, Randomness, and Sampling
tions from expectation are obviously small. The three groups are "equal"" in the sense that
they have equal numbers of Republican senators — and. of course. Democrats."
Total
Sampling and Randomness • 117
two issues is approximately the same in each of the groups. The deviations from chance
expectation ol the \ea votes (and. of course, the Nay votes) arc small. So far as we can
see. then, the randomi/^ation has been "successful."'- We can now do our experiment
believing that the three groups are "equal." They may not be, of course, but the probabil-
ities are in our favor. And as we have seen, the procedure usually works well." Our
checking of the characteristics of the senators in the three groups showed that the groups
were fairly "equal" in political preference and Yea (and Nay) votes on the two issues.
Thus we can have greater confidence that if the groups become unequal in attitudes toward
a nuclear test ban, the differences are probably due to our experimental manipulation and
not to differences between the groups before we started.
SAMPLE SIZE
A rough-and-ready rule taught to beginning students of research is: Use as large samples
as possible. Whenever a mean, a percentage, or other statistic is calculated from a sample,
a population value is A question that must be asked is; How much error is
being estimated.
likely to be from samples of different sizes'? The curve of Figure 8.
in statistics calculated
roughly expresses the relations between sample size and error, error meaning deviation
from population values. The curve says that the smaller the sample the larger the error,
and the larger the sample the smaller the error.
Consider the following rather extreme example. Total reading and total mathematics
scores of 327 Eugene. Oregon, sixth-grade children on the Metropolitan Achievement
Large
Error
Small
Small Large
Size of Sample
Figure 8.1
'"This demonstration can also be interpreted as a random sampling problem. We may ask. for e.xample.
whether the three samples of 20 each and the total sample of 60 are representative Do they accurately reflect the
,
characteristics of the population of 100 senators? For instance, do the samples retlect the proportions of Republi-
cans and Democrats in the Senate? The proportions in the samples were all .50 and .50. The actual proportions
are .53 and .47. Although there are 3 percent deviations in the samples, the deviations are within chance
expectation. We can say. therefore, that the samples are representative insofar as political party membership is
concerned. Similar reasoning applies to the samples and the votes on the two issues.
'
No less an expert than Feller, however, writes; "In sampling human populations the statistician encounters
considerable and often unpredictable difficulties, and bitter experience has shown that it is difficult to obtain
even a crude image of randomness." Feller, op. dr., p. 29.
118 • Probability. Randomness, and Samplitig
Tests (administered in 1978), together with the sex of the pupils, were made available to
me.''* From this "population, "" 10 samples of two pupils each were randomly selected."
These sample scores and the sample rneaUs are given in Table 8.4. The deviations of the
means from the means of the population are also given in the table.
TABLE 8.4 Samples (n = 2) of Reading and Mathematics Scores of 327 Sixth-Grade Children.
Means of the Samples, and Deviations of the Sample Means from the Population
Mean''
Sampling and Randomness • 119
TABLE 8.5 Means and Deviations from Population Means of Four Reading and Four Mathemat-
ics Samples, n = 20. Total Sample, n = 80. and Population. N = 327, Eugene
Data''
120 • Probability, Randomness, and Sampling
are representative, "typical," and suitable for certain research purposes. Quota sampling
derives its name from the practice of assigning quotas, or proportions of kinds of people,
to interviewers. Such sampling has been used a good deal in public opinion polls. Another
form of nonprobability sampling is purposive sampling, which is characterized by the use
of judgment and a deliberate effort to obtain representative samples by including presuma-
bly typical areas or groups in the sample. So-called "accidental" sampling, the weakest
form of sampling, is probably also the most frequent. In effect, one takes available
samples at hand; classes of seniors in high school, sophomores in college, a convenient
PTA, and the like. This practice is hard to defend. Yet, used with reasonable knowledge
and care, it is probably not as bad as it has been said to be. The most sensible advice
seems to be: Avoid accidental samples unless you can get no others (random samples are
usually expensive and, in general, hard to come by) and, if you do use them, use extreme
circumspection in analysis and interpretation of data.
The most general of these are strati-
Probability sampling includes a variety of forms.
fied sampling and cluster sampling. In stratified sampling, the population is divided into
strata, such as men and women, black and white, and the like, from which random
samples are drawn. Cluster sampling, the most used method in surveys, is the successive
random sampling of units, or sets and subsets. In educational research, for example,
school districts of a state or county can be randomly sampled, then schools, then classes,
and Another kind of probability sampling
finally pupils. —
if, indeed, it can be called
probability sampling —
is systematic sampling. Here the first sample element is randomly
chosen from numbers 1 through k and subsequent elements are chosen at every Ath
interval. For example, if the element randomly selected from the elements 1 through 10 is
6, then the subsequent elements are 16.26. 36, and so on. The student who will pursue
research further should, of course, know much more about these methods and should
consult one or more of the excellent references on the subject.'^
Randomness, randomization, and random sampling are among the great ideas of sci-
ence, as indicated earlier. While research can, of course, be done without using ideas of
randomness, it is difficult to conceive how it can have viability and validity, at least in
most aspects of behavioral scientific research. Modem notions of research design, sam-
pling, and inference, for example, are literally inconceivable without the idea of random-
ness. One of the most remarkable of paradoxes is that through randomness, or "dis-
order," we are able to achieve control over the often obstreperous complexities of
psychological, sociological, and educational phenomena. We impose order, in short, by
exploiting the known behavior of sets of random events. One is perpetually awed by what
can be called the structural beauty of probability, sampling, and design theory and by its
great usefulness in solving difficult problems of research design and planning and the
analysis and interpretation of data.
Before leaving the subject, let's return to a view of randomness mentioned earlier. To
an omniscient being, there is no randomness. By definition such a being would "know"
the occurrence of any event with complete certainty. '** As Poincare points out. to gamble
with such a being would be a losing venture. Indeed, it would not be gambling. When a
"A clear exposition of different kinds of sampling can be found in; F. Stephan and P. McCarthy, Sampling
Opinions. New York: Wiley, 1%3 (1958), chap. 3. An excellent account of general principles of sampling,
together with examples and formulas for estimates, G. Snedecor and W. Cochran. Siurisiical Methods. 5th
is:
ed. Ames, Iowa; Iowa Although oriented toward biology and agricul-
State University Press, 1967. chap. 17.
ture, the principles and methods of this authoritative book are easily applied to behavioral
disciplines. Another authoritative reference is; L. Kish. "Selection of the Sample." In L. Festinger and
D. Katz, eds.. Research Methods in the Behavioral Sciences. New York; Holt. Rinehart and Winston, 1953.
pp. 175-239. On sampling and estimation, see D. Warwick and C. Lininger, The Sample Survey: Theory
and Practice. New York: McGraw-Hill, 1975. chap. 4.
'*For an eloquent discussion of this point, see Poincare's essay on chance; H. Poincar^, Science and
Method. New York; Dover. 1952. pp. 64-90.
Sampling and Randomness • 121
coin was tossed ten times, he (or she) would predict heads and tails with complete cer-
tainty and accuracy. When dice were thrown, he would know infallibly what the outcomes
will be. He would even be able to predict every number in a table of random numbers!
And certainly he would have no need for research and science. What seem to be saying is I
that randomness term for ignorance. If we, like the omniscient being, knew all the
is a
contributing causes of events, then there would be no randomness. The beauty of it, as
indicated above, is that we use this ""ignorance" and turn it to knowledge. How we do this
should become more and more apparent as we go on with our study.
Study Suggestions
A variety of experiments with chance phenomena is recommended: games using coins, dice, cards,
roulette wheels, and tables of random numbers. Such games, properly approached, can help one
learn a great deal about fundamental notions of modem scientific research, statistics, probability,
and, of course, randomness. Try the problems given in the suggestions below. Do not become
discouraged by the seeming laboriousness of such exercises here and later on in the book. It is
evidently necessary and, indeed, helpful occasionally to go through the routine involved in certain
problems. After working the problems given, devise some of your own. If you can devise intelligent
problems, you are probably well on your way to understanding.
1. From a table of random numbers draw 50 numbers, through 9. (Use the random numbers
of Appendix C, if you wish.) List them in columns of 10 each.
(a) Count the total number of odd numbers; count the total number of even numbers. What
would you expect to get by chance? Compare the obtained totals with the expected
totals.
(b) Count number of numbers 0, 1,2, 3, 4. Similarly count 5, 6, 7, 8, 9. How
the total
many of the group should you get? The second group? Compare what you do get
first
2. This
is a class exercise and demonstration. Assign numbers arbitrarily to all the members of
the classfrom 1 through N. N being the total number of members of the class. Take a table of
random numbers and start with any page. Have a student wave a pencil in the air and blindly stab at
the page of the table. Starting with the number the pencil indicates, choose n two-digit numbers
between 1 and A' (ignoring numbers greater than N and repeated numbers) by, say, going down
columns (or any other specified way), n is the numerator of the fraction n/N. which is decided by the
size of the class. If N= 30, for instance, let /; = 10. Repeat the process twice on different pages of
the random numbers You now have three equal groups (if A' is not divisible by 3, drop one or
table.
two persons at random). Write the random numbers on the blackboard in the three groups. Have
each class member call out his height in inches. Write these values on the blackboard separate from
the numbers, but in the same three groups. Add the three sets of numbers in each of the sets on the
blackboard, the random numbers and the heights. Calculate the means of the six sets of numbers.
Also calculate the means of the total sets.
(a) How close are the means in each of the sets of numbers? How close are the means of the
groups to the mean of the totalgroup?
(b) Count the number of men and women in each of the groups. Are the sexes spread fairly
evenly among the three groups?
(c) Discuss this demonstration. What do you think is its meaning for research?
122 • Probability, Randomness, and Sampling
3. In Chapter 6, it was suggested that the student generate 20 sets of 100 random numbers
between and 100 and calculate means and variances. If you did this, use the numbers and statistics
in this exercise. If you did not, use the numbers and statistics of Appendix C at the end of the book.
(a) How close to the population mean are the means of the 20 samples? Are any of the
means "deviant"? (You might judge this by calculating the standard deviation of the
means and adding and subtracting two standard deviations to the total mean.)
(b) On the basis of (a), above, and your judgment, are the samples "representative"? What
does "representative" mean?
(c) Pick out the third, and ninth group means. Suppose that 300 subjects had been
fifth,
assigned at random groups and that these were scores on some measure of
to the three
importance to a study you wanted to do. What can you conclude from the three means,
do you think?
4. Most published studies in the behavioral sciences and education have not used random
samples, especially random samples of large populations. Occasionally, however, studies based on
random samples are done. One such study is;
This study worth careful reading, even though its level of methodological sophistication puts a
is
number of itsbeyond our present grasp. Try not to be discouraged by this sophistication. Get
details
what you can out of it, especially its sampling of a large population of young men. Later in the book
we will return to the interesting problem pursued. At that time, perhaps the methodology will no
longer appear so formidable. (In studying research, it is sometimes helpful to read beyond our
present capacity — provided we don't do too much of it!)
5. Random assignment of subjects to experimental groups is much more common than random
sampling of subjects. A particularly good, even excellent, example of research in which subjects
were assigned at random to two experimental groups, is:
Again, don't be daunted by the methodological details of this study. Get what you can out of it.
Note at this time how the subjects were classified into aptitude groups and then assigned at random
to experimental treatments. We will also return to this study later. At that time, we should be able to
understand its purpose and design and be intrigued by its carefully controlled experimental pursuit
of a difficuh substantive educational problem: the comparative merits of so-called individualized
mastery instruction and conventional lecture-discussion-recitation instruction.
Special Note. In some of the above study suggestions and in those of Chapter 6. instructions were
given to draw numbers from tables of random numbers or to generate sets of random numbers using
a computer. Ifyou have a microcomputer or have access to one. you may well prefer to generate the
random numbers using the built-in random number generator (function) of the microcomputer.
Study the computer manual to find out how to produce such numbers. It should be simple. On the
widely available TRS-80. Apple, and IBM machines, for example, random numbers can be pro-
duced quite easily with a few instructions in BASIC, the language common to most microcomput-
ers. How "good" are the random numbers generated? ("How good?" means "How random?")
Since they are produced in line with the best contemporary theory and practice, they should be
satisfactory, although they might not meet the exacting requirements of some experts. In my experi-
ence, they are quite satisfactory, and 1 recommend their use to teachers and students. (See, also,
footnotes 10 and 15.)
PART FOUR
^
"^
Analysis,
interpretation,
statistics, and
inference
Chapter 9
Principles of Analysis
and Interpretation
The research analyst breaks down data into constituent parts to obtain answers to re-
search questions and to test research hypotheses. The analysis of research data, however,
does not in and of itself provide the answers to research questions. Interpretation of the
data is necessary. To interpret is to explain, to find meaning. It is difficult or impossible to
explain raw data: one must first analyze the data and then interpret the results of the
analysis.'
Analysis means the categorizing, ordering, manipulating, and summarizing of data to
obtain answers to research questions. The purpose of analysis is to reduce data to intelligi-
ble and interpretable form so that the relations of research problems can be studied and
tested. A primary purpose of statistics, for example, is to manipulate and summarize
numerical data and to compare the obtained results with chance expectations. A researcher
hypothesizes that styles of leadership affect group-member participation in certain ways.
He plans an experiment, executes the plan, and gathers data from his subjects. Then he
must so order, break down, and manipulate the data that he can answer the question: How
do styles of leadership affect group-member participation? It should be apparent that this
view of analysis means that the categorizing, ordering, and summarizing of data should be
'
"Data. "as used in behavioral research, means research results from which inferences are drawn; usually
numerical results, like scores of tests and statistics such as means, percentages, and correlation coefficients. The
word is also used to stand for the results of mathematical and statistical analysis; we will soon study such
analysis and its results. "Data" can be more, however; newspaper and magazine articles, biographical materi-
als, diaries, and so on — indeed, verbal materials in general. In other words, "data" is a general term with
several meanings. Think of research data, too, as the results of systematic observation and analysis used to make
inferences and arrive at conclusions. Scientists make observations, assign symbols and numbers to the observa-
tions, manipulate the symbols and numbers to put them into interpretable fonr, and then, from these "data,"
make inferences about the relations among the variables of research problems. 'Data" ( is usually a plural noun,
and we will so use it in this book. The singular is the seldom-used "datum")
126 • Analysis, InteqDretation, Statistics, and Inference
planned early in the research. The researcher should lay out analysis paradigms or models
even when working on the problem and hypotheses. Only in this way can he see. even if
only dimly, whether his data and its analysis can and will answer the research questions.
Interpretation takes the results of analysis, makes inferences pertinent to the research
relations studied, and draws conclusions about these relations. The researcher who inter-
prets research results searchesthem for their meaning and implications. This is done in
two ways. One, the relations within the research study and its data are interpreted. This is
the narrower and more frequent use of the term interpretation. Here interpretation and
analysis are closely intertwined. One almost automatically interprets as one analyzes.
That is, when one calculates, say, a coefficient of correlation, one almost immediately
infers the existence of a relation.
Two. meaning of research data is sought. One compares the results and
the broader the
inferences drawn from the data to theory and to other research results. One seeks the
meaning and implications of research results within the study results, and their congru-
ence or lack of congruence with the results of other researchers. More important, one
compares results with the demands and expectations of theory.
An example that may illustrate these ideas is research on perception of teacher charac-
teristics." On the basis of so-called directive-state and social perception theory,-' it was
predicted that perceptions or judgments of desirable characteristics of effective teachers
will in part be determined by the attitudes toward education of the individuals making the
judgments. Suppose, now, that we have measures of attitudes toward education and
measures of the perceptions or judgments of the characteristics of effective teachers. We
correlate the two sets of measures: the correlation is substantial. This is the analysis. The
data have been broken down into the two sets of measure, which are then compared by
means of a statistical procedure.
The result of the analysis, a correlation coefficient, now has to be interpreted. What is
its meaning? Specifically, what is its meaning within the study? What is its broader
meaning in the light of previous related research findings and interpretations? And what is
its meaning as confirmation or lack of confirmation of theoretical prediction? If the "in-
temal"" prediction holds up. one then relates the finding to other research findings which
may or may not be consistent with the present finding.
The correlation was substantial. Within the study, then, the correlation datum is con-
sistent with theoretical expectation. Directive-state theory says that central states influ-
ence perceptions. Attitude is it should therefore influence perception. The
a central state;
specific inference is toward education influence perceptions of the effective
that attitudes
teacher. We measure both variables and correlate the measures. From the correlation
coefficient we make an inferential leap to the hypothesis; since it is substantial, as pre-
dicted, the hypothesis is supported. We then attempt to relate the finding to other research
and other theory.
'F, Kerlinger and E. Pedhazur, "Educational Attitudes and Perceptions of Desirable Traits of Teachers,"
American Etlucanonul Research Journal, fi (1968). 543-560.
^Direclive-stale theory is a broad theory of perception that says in effect that our perceptions of cognitive
objects are colored by our emotions, needs, wants, motives, attitudes, and values. These latter are, so to speak,
directive states within the individual influencing his perceptions and judgments. F. Allport, Theories of Percep-
tion and the Concept nf Structure. New York: Wiley, 1955, chaps. 13, 14, and 15.
Principles of Analysis and interpretation • 127
continuous and categorical variablesin Chapter 3.) Although both kinds of variables and
measures can be subsumed under the same measurement frame of reference, in practice it
is necessary to distinguish them.
Freqiiemics are the numbers of objects in sets and subsets. Let U be the universal set
with N objects. Then A' is the niimher of objects in U. Let U be partitioned into A\. At,
. . . , A^. Let W|, /It n^ be the numbers of objects in A), At, . • . , A^. Then a!|, wt,
. ... Ill, are called frequencies.
It is helpful to look at this as a function. Let X be any set of objects with members {x\,
X2 -v,,}. Wc wish to measure an attribute of the members of the set; call it M. Let
Y = {0,1}. Let the measurement be described as a function:
With continuous measures, the basic idea is the same. Only the rule of correspond-
ence,/, and the numerals assigned to objects change. The rule of correspondence is more
elaborate and the numerals are generally 0, 1, 2, . . . and fractions of these numerals. In
other words, we write a measurement equation:
/= {(-v.y); ,v is an object, and y = any numeral}
which is the generalized form of the function.* This digression is important, because it
helps us to see the basic similarity of frequency analysis and continuous measure analysis.
RULES OF CATEGORIZATION
The first step in any analysis is categorization. It was said earlier (Chapter 4) that parti-
tioning is the foundation of analysis. We will now see why. Categorization is merely
another word for partitioning — that is, a category is a partition or a subpartition. If a set
of objects is categorized in some way, it is partitioned according to some rule. The rule
tells us, in effect, how and subpartitions. If this is so,
to assign set objects to partitions
then the rules of partitioning we studied earlier apply to problems of categorization. We
need only explain the rules, relate them to the basic purposes of analysis, and put them to
work in practical analytic situations.
Five rules of categorization are given below. Two of them, (2) and (3), are the exhaus-
tiveness and disjointness rules discussed in Chapter 4. Two others, (4) and (5), can
actually be deduced from the fundamental rules, (2) and (3). Nevertheless, we list them as
separate rules for practical reasons.
Rule 1 is the most important. If categorizations are not set up according to the de-
""This equation and the ideas behind it will be explained in detail in Chapter 25.
128 • Analysis, Interpretation, Statistics, and Inference
Low
Medium Frequencies
High
Since Rosenberg and Simmons had continuous measures of self-esteem, they might have
used this paradigm:
Black White
Self-Esteem
Measures
It is obvious that both paradigms bear directly on the problem: it is possible in both to
test the relation between race and self-esteem, albeit in different ways. The authors chose
—
method and found, contrary to common expectation, that black children's self-
the first
esteem was higher, not lower, than that of white children. The second paradigm would
undoubtedly have led to the same conclusion. The point is that an analytical paradigm is,
in effect, another way to state a problem, a hypothesis, a relation. That one paradigm uses
frequencies while the other uses continuous measures in no way alters the relation tested.
In other words, both modes of analysis are logically similar: they both test the same
proposition. They differ in the data they use, in statistical tests, and in sensitivity and
power.
There are several things a researcher might do that would be irrelevant to the problem.
If he included one, two, or three variables in the study with no theoretical or practical
reason for doing so, then the analytic paradigm would be at least partly irrelevant to the
between the two types of schools and. of course, between religious instruction and no
religious instruction. He might bring other variables into the picture that have little or no
bearing on the problem, for example, differences in teacher experience and training or
teacher-pupil ratios. If, on the other hand, he thought that certain variables, like sex,
family religious background, and perhaps personality variables, might interact with reli-
*M. Rosenberg and R. Simmons, Black and White Self-Esteem: The Urban Child. Washington. D.C.:
American Sociological Association, 1971.
Principles of Analysis and Interpretation • 129
gious instruction to produce differences, then he might be justified in building such varia-
bles into the research problem and consequently into the analytic paradigm.*
Rule 2,on exhaustiveness, means that all subjects, all objects of U must be used up. ,
All individuals in the universe must be capable of being assigned to the cells of the
analytic paradigm. With the example just considered, each child either goes to parochial
school or to public school. If, somehow, the sampling had included children who attend
private schools, then the aile would be violated because there would be a number of
children who could not be fitted into the implied paradigm of the problem. (What would a
frequency analysis paradigm look like? Conceive the dependent variable as honesty.) If,
however, the research problem called for private school pupils, then the paradigm would
have to be changed by adding the rubric Private to the rubrics Parochial and Public.
The exhaustiveness criterion is not always easy to satisfy. With some categorical
variables, there is no problem. If sex is one of the variables, any individual has to be male
or female. Suppose, however, that a variable under study were religious preference and
we set up, in a paradigm, Protestant-Catholic-Jew. Now suppose some subjects were
atheists or Buddhists. Clearly the categorization scheme violates the exhaustiveness rule:
some subjects would have no cells to which to be assigned. Depending on numbers of
cases and the research problem, we might add another rubric. Others, to which we assign
subjects who are not Protestants, Catholics, or Jews. Another solution, especially when
the number of Others is small, is to drop these subjects from the study. Still another
solution is to put these other subjects, if it is possible to do so, under an already existing
rubric. Other variables where this problem is encountered are political preference, social
class, types of education, and so on.
Rule 3 is one that often causes research workers concern. To demand that the catego-
ries be mutually exclusive means, as we learned earlier, that each object of U, each
research subject (actually the measure assigned to each subject), must be assigned to one
cell and one cell only of an analytic paradigm. This is a function of operational definition.
Definitions of variables must be clear and unambiguous so that it is unlikely that any
subject can be assigned to more than one cell. If religious preference is the variable being
defined, then the definition of membership in the subsets Protestant, Catholic, and Jew
must be clear and unambiguous. It may be "registered membership in a church." It may
be "bom in the church." It may simply be the subject's identification of himself as a
Protestant, a Catholic, or a Jew. Whatever the definition, it must enable the investigator to
assign any subject to one and only one of the three cells.
The independence part of Rule 3 is often difficult to satisfy, especially with continu-
ous measures — and sometimes with frequencies. Independence means that the assign-
ment of one object to a cell in no way affects the assignment of any other object to that cell
or to any other cell. Random assignment from an infinite or very large universe, of course,
satisfies the rule. Without random assignment, however, we run into problems. When
assigning objects to cells on the basis of the object's possession of certain characteristics,
the assignment of an object now may affect the assignment of another object later.
Rule 4, that each category (variable) be derived from one classificatory principle, is
sometimes violated by the neophyte. If one has a firm grasp of partitioning, this error is
easily avoided. The rule means that, in setting up an analytic design, each variable has to
be treated separately, because each variable is a separate dimension. One does not put two
^In the next chapter, elementary consideration will be given to frequency analysis with more than one
independent variable. Inlater chapters there will be much more detailed consideration of both frequency and
continuous measure analysis with several independent variables. The reader should not now be concerned with
complete understanding of examples like those given above. They will be clarified later.
130 • Analysis, Inteqjretation, Statistics, and Inference
Admitted
Not Admitted
It is clear that thisparadigm violates the rule: it has one category derived from two
variables. Each variable must have its own category. A correct paradigm might look like
this:
Principles of Analysis and Interpretation • 131
have as their purpose basic understanding of statistics and statistical inference and the
relation of statistics and statistical inlerence to research. In this section, the major forms of
statistical analysis are discussed briefly to give the reader an overview of the subject; they
are discussed, however, only in relation to research. It is assumed that the reader has
already studied the simpler descriptive statistics. Those who have not can find good
discussions in elementary textbooks.'^
Frequency Distributions
Although frequency distributions are used primarily for descriptive purposes, they can be
used for other research purposes. For example, one can test whether two or more distribu-
tions are sufficiently similar to warrant merging them. Suppose one were studying the
verbal learning of boys and girls in the sixth grade. After obtaining large numbers of
verbal learning scores, one can compare and test the differences between the boy and girl
distributions.'" if the test shows the distributions to be the same — and other criteria are
satisfied — they can perhaps be combined for other analysis.
Observed distributions can also be compared to theoretical distributions. The best-
known such comparison is with the so-called normal distribution. It may be important to
know that obtained distributions are normal in form, or, if not normal, depart from
normality in certain specifiable ways. Such analysis can be useful in other theoretical and
applied work and research. In theoretical study of abilities it is important to know whether
such abilities are in fact distributed normally. Since a number of human characteristics
have been found to be normally distributed," researchers can ask significant questions
about "new"" characteristics being investigated.
Applied educational research can profit from careful study of distributions of intelli-
gence, aptitude, and achievement scores. Is it conceivable that an innovative learning
program can change the distributions of the achievement scores, say, of third and fourth
graders? Can it be that massive early education programs can change the shape of distribu-
tions, as well as the general levels of scores?
Allport"s study of social conformity many years ago showed that even a complex
behavioral phenomenon like conformity can be profitably studied using distribution
analysis.'" Allport was able to show that a number of social behaviors stopping for red —
lights, parking violations, religious observances, and so on — were distributed in the form
of a J curve, with most people conforming, but with predictable smaller numbers not
conforming in different degrees.
Distributions have probably been too
little used in the behavioral sciences and educa-
tion. The study of and the testing of hypotheses are almost automatically associ-
relations
ated with correlations and comparisons of averages. The use of distributions is considered
less often. Some research problems, however, can be solved better by using distribution
analysis. Studies of pathology and other unusual conditions are perhaps best approached
through a combination of distribution analysis and probabilistic notions.
'For example, A. Edwards, Statistical Analysis. 3d ed. New York; Holt, Rinehart and Winston. 1969; D.
Freedman. R. Pisani, and R. Purves. Statistics. New York; Norton, 1978.
'"See W. Hays, Statistics. 3d ed. New York: Holt. Rinehart and Winston, 1981. pp. 576ff.
A. Anastasi, Individual Differences. 3d ed. New York: Macmillan. 1958. pp. 26ff. The student of research
'
'
in education, psychology, and sociology should study Anastasi"s outstanding contribution to our understanding
of individual differences. Her book also contains many examples of distributions of empirical data.
'-F. Allport. "The J-Curve Hypothesis of Conforming Behavior." In T. Newcomb and E. Hartley, eds..
Readings in Social Psychology. New York: Holt. Rinehart and Winston. 1947. pp. 55-67.
132 • Analysis. InteqDretation, Statistics, and Inference
One of the most powerful tools of analysis is the graph. A graph is a two-dimensional
representation of a relation or relations. It exhibits pictorially sets of ordered pairs in a
way no other method can. If a relation exists in a set of data, a graph will not only clearly
show it; it will show its nature: positive, negative, linear, quadratic, and so on. While
graphs have been used a good deal in the behavioral sciences, they, like distributions,
probably have not been used enough. To be sure, there are objective ways of epitomizing
and testing relations, such as correlation coefficients, comparison of means, and other
statistical methods, but none of these so vividly and uniquely describes a relation as a
'''
graph.
Look back at the graphs in Chapters (Figures 5.1, 5.4, 5.5, and 5.6). Note how they
convey the nature of the relations. Later we will use graphs in a more interesting way to
show the nature of rather complex relations among variables. To give the student just a
taste of the richness and interest of such analysis, we anticipate later discussion; in fact,
we will try to teach a complex idea through graphs.
The three graphs of Figure 9.1 show three hypothetical relations between age, as an
independent variable, and verbal achievement (VA), as dependent variable, of middle-
class children (A), and working-class children (B). One can call these growth graphs. The
horizontal axis is the abscissa; it is used to indicate the independent variable, or X. The
vertical axis is the ordinate; it is used to indicate the dependent variable, or Y. Graph (a)
shows the same positive relation between age and VA
A and B samples. It also
with both
shows that the A children exceed the B Graph
however, shows that both
children. (b),
relations are positive, but that as time goes on the A children's achievement increases
more than the B children's achievement. This seems to be the sort of phenomenon that
Coleman et ai. found when comparing the verbal achievement of majority- and minority-
group children in grades 3. 6, 9, and 12.'"* Graph (c) is more complex. It shows that the
A children were superior to the B children at an early age and remained the same to a later
Figure 9.1
"Drawing graphs is today greatly expedited by the availability of computers and computer programs with
graphic capabilities. For instance, some computer programs routinely incorporate the possibility of printing
so-called scatter plots of data. The labor of drawing graphs — except in special circumstances — has virtually
been eliminated Interpretation, of course, remains a problem,
'•j. Coleman et al.. Equality of Educational Opportunity. Washington. D.C.: U.S. Government Printing
Office. 1966. See, especially, pp. 20ff, and 220ff,
Principles of Analysis and Interpretation • 133
age, but the B children, who started lower, advanced and continued to advance over time
until they exceeded (he A children. This sort of relation is unlikely with verbal achieve-
ment, but it can occur with other variables.
The phenomenon shown in graphs (b) and (c) is known as interaction. Briefly, it
means that two (or more) variables interact in their "effect" on a dependent variable. In
this case, age and group status interact in their relation to verbal achievement. Expressed
differently, interaction means that the relation of an independent variable to a dependent
variable is different in different groups, as in this case, or at different levels of another
independent variable. It will be explained in detail and more accurately when we study
analysis of variance and multiple regression analysis.
While means are one of the best ways to report complex data, complete reliance on
them can be unfortunate. Most cases of significant mean differences between groups are
also accompanied by considerable overlap of the distributions. Clear examples are given
by Anastasi. who points out the necessity of paying attention to overlapping and gives
examples and graphs of sex distribution differences, among others. " In short, students of
research are advised to get into the habit, from the beginning of their study, of paying
attention to and understanding distributions of variables and to graphing relations of
variables.
There doubt that measures of central tendency and variability are the most impor-
is little
tant tools of behavioral data analysis. Since much of this book will be preoccupied with
such measures —
indeed, a whole section is called "The Analysis of Variance" we need —
only characterize averages and variances. The three main averages, or measures of central
tendency, used in research, the mean, median, and mode, are epitomes of the sets of
measures from which they are calculated. Sets of measures are too many and too complex
to grasp and understand readily. They are "represented" or epitomized by measures of
central tendency. They tell what sets of measures "are like" on the average. But they are
also compared to test relations. Moreover, individual scores can be usefully compared to
them to assess the status of the individual. We say, for instance, that individual A's score
is such-and-such a distance above the mean.
While the mean most used average in research, and while it has desirable
is the
properties that justify preeminent position, the median, the midmost measure of a set of
its
measures, and the mode, the most frequent measure, can sometimes be useful in research.
For instance, the median, in addition to being an important descriptive measure, can be
used in tests of statistical significance where the mean is inappropriate.'^ The mode is
used mostly for descriptive purposes, but it can be useful in research for studying charac-
teristicsof populations and relations. Suppose that a mathematical aptitude test was given
to allincoming freshmen in a college that has just initiated open admissions, and that the
distribution of scores was bimodal. Suppose, further, that only a mean was calculated,
compared to means of previous years, and found to be considerably lower. The simple
conclusion that the average mathematical aptitude of incoming freshmen was considerably
lower than in previous years conceals the fact that because of the open admissions policy
many freshmen were admitted who had deficient backgrounds in mathematics. While this
103-105; 170-174; 203-209; 237-239. Types of means and other measures of central tendency are exception-
ally well discussed in: M. Tate, Statistics in Education. New York: Macmillan. 1955, chap. II, Tate also gives a
number of good examples of distributions and graphs of various kinds. Though old, this is a valuable book.
134 • Analysis, Interpretation, Statistics, and Inference
obscuring important sources of differences can be more subtle. It often pays off in re-
'^
search, in other words, to calculate medians and modes as well as means.
The principal measures of variability are the variance and the standard deviation. They
have already been discussed and will be discussed further in later chapters. We therefore
forego discussion of them here, except to say that research reports should always include
variability measures. Means should almost never be reported without standard deviations
(and A'"s, the sizes of samples), because adequate interpretation of research by readers is
virtually impossible without variability indices. Another measure of variability that has in
recent years become more important is the range, the difference between the highest and
lowest measures of a set of measures. It has become possible, especially with small
samples (with A' about 20 or 15 or less), to use the range in tests of statistical significance.
Measures of Relations
Analysis of Differences
tested.Or one might want to know whether groups set up to be homogeneous are homoge-
neous on variables other than those used to form the groups."^
The second point is more impi>rtant. All analysis of differences is really for the
purpose o\ studymg relations. Suppose one believes that changing art preferences toward
greater complexity will transfer to music preferences and sets up three experimental
groups, one of which is given the greater complexity manipulation.'"^ One finds the
predicted differences between the means of the three groups on music preferences, with
the one given the complexity manipulation the highest. It is not really these differences
that interest us,however. It is the relation of the study; that between the complexity
modification on art preference and the modification toward greater music complexity
preference. Differences between means, then, really reflect the relation between the inde-
pendent variable and the dependent variable. If there are no significant differences among
means, the correlation between independent variable and dependent variable is zero. And,
conversely, the greater the differences the higher the correlation, other things equal.
Suppose an experiment to study the effect of random reinforcement of opinion utter-
ance on rate of opinion utterance has been done and the experimental group, which
received random reinforcement, had a mean of 6 utterances in a specified period of time,
and the control group, which received a regular rate of reinforcement, had a mean of 4
utterances.-" The difference is and we conclude from the signifi-
statistically significant,
cant difference that there is between reinforcement and opinion utterance rate.
a relation
In earlier chapters, relations between measured variables were plotted to show the nature
of the relations. It is possible, too, to graph the present relation between the experimental
(manipulated) independent variable and the measured dependent variable. This has been
done in Figure 9.2, where the means have been plotted as indicated. While the plotting is
more or less arbitrary —
for instance, there are no real baseline units for the independent
variable —
the similarity to the earlier graphs is apparent and the basic idea of a relation is
clear.
If the reader will always keep in mind that relations are sets of ordered pairs, the
conceptual similarity of Figure 9.2 to earlier graphs will be evident. In the earlier graphs,
each member of each pair was a score. In Figure 9.2, an ordered pair consists of an
experimental treatment and a score. If we assign 1 to the experimental group and to the
control group, two ordered pairs might be: (1, 6). (0, 4).
'*A quick, easy, and is given in E. Pearson and H. Hartley, eds., Biometrika
effective test of vaiiances
Tables for Statisticians, vol. Cambridge: University of Cambridge Press. 1954. pp. 60-61 and 179. This
I.
well-known volume of statistical tables and its examples and explanations are most useful to researchers.
.Another useful and well-known volume of statistical tables is: R Fisher and F. Yates, eds.. Statistical Tables for
Biological. Agricultural and Medical Research. New York: Hafner. 1963.
'''V. Renner. "Effects of Modification of Cognitive Style on Creative Behavior." Journal of Personality-
Dependent
Variable
D 4
a.
O
Independent
Variable
Control Experimental
Group Group
Figure 9.2
and some to other causes. Analysis of variance's job is to work with these different
variances and sources of variance. Strictly speaking, analysis of variance is inore appro-
priate for experimental than for nonexperimental data, even though its inventor, Fisher,
used it with both kinds."' We will consider it, then, a method for the analysis of data
yielded by experiments in which randomization and manipulation of at least one indepen-
dent variable have been used.
There is probably no better way to study research design than through an analysis of
variance approach. Those proficient with the approach almost automatically think of
alternative analysis of variance models when confronted with new research problems.
Suppose an experienced is asked to assess the differen-
social psychological investigator
tial effects of three kinds of group cohesiveness on learning. He will immediately think of
a simple one-way analysis of variance, which will look like the paradigm on the left
[marked (a)] of Figure 9.3. If he also thinks that cohesiveness may affect children of
higher intelligence differently than children of lower intelligence, then the paradigm will
look like that on the right (b).*" Clearly, analysis of variance is an important method of
studying differences.
(a) (b)
Group Group
Cohesiveness Cohesiveness
A, A2 A3
Learning
Scores
Principles of Analysis and Interpretation •
137
Profile Analysis
Profile unalysis is basically the assessment of the similarities of the profiles of individuals
or groups. A profile is a set of different measures of an individual or group, each of which
is expressed in the same unit of measure. An individual's scores on a set of different tests
constitute a profile, if all scores have been converted to a common measure system, like
percentiles, ranks, and standard scores. Profiles have been used mostly for diagnostic
purposes — for instance, the profiles of scores from test batteries are used to assess and
advise high school pupils. But profile analysis is becoming increasingly important in
psychological and sociological research, as we will see later when we study, among other
things, Q methodology.
Profile analysis has special problems that require researchers' careful consideration.
Similarity, for example, is not a general characteristic of persons; it is similarity only of
specified characteristics or complexes of characteristics."^ Another difficulty lies in what
informatit)n one is willing to sacrifice in calculating indices of profile similarity. When
one uses the product-moment coefficient of correlation —
which is a profile measure one —
loses level; that is, differences between means are sacrificed. This is loss of elevation.
Product-moment r's take only shape into account. Further, scatter differences in varia- —
bility of profiles —
is lost in the calculation of certain other kinds of profile measures. In
short, information can be and is lost. The student will find excellent help and guidance
with profile analysis in Nunnally's book on psychometrics, though the treatment is not
elementary.-"*
Multivariate Analysis
Perhaps the most important forms of statistical analysis, especially at the present stage of
development of the behavioral sciences, are multivariate analysis and factor analysis.
Multivariate analysis is a general term used to categorize a family of analytic methods
whose chief characteristic is the simultaneous analysis of k independent variables and m
dependent variables.-'' If an analysis includes, for instance, four independent variables
and two dependent variables, handled simultaneously, it is a multivariate analysis.
It can be argued that, of all methods of analysis, multivariate methods are the most
powerful and appropriate for scientific behavioral research. The argument to support this
statement is long and involved and would sidetrack us from our subject. Basically, it rests
on the idea that behavioral problems are almost all multivariate in nature and cannot be
solved with a bivariate (two-variable) approach —
that is, an approach that considers only
one independent and one dependent variable at a time. This has become strikingly clear in
much educational research where, for instance, the determinants of learning and achieve-
ment are complex; intelligence, motivation, social class, instruction, school and class
atmosphere and organization, and so on. Evidently variables like these work with each
other, sometimes against each other, mostly in unknown ways, to affect learning and
'L. Cronbach and G Gleser. "Assessing Similarity Between Profiles," Psychological Bulletin. 50 (1953),
456-473 (p. 457).
-"J. Nunnally. Psychometric Theory. 2d ed. New York: McGraw-Hill. 1978, chap. 12.
^In this book we will not be excessively concerned about the terminology used with multivariate analysis.
To some, multivariate analysis includes factor analysis and other forms of analysis, like multiple regression
analysis."Multivariate" to these individuals means more than one independent variable or more than one
dependent variable, or both. Others in the field use "multivariate analysis" only for the case o{ both multiple
independent and multiple dependent variables.
138 • Analysis. Interpretation, Statistics, and Inference
achievement. In other words, to account for the complex psychological and sociological
phenomena of education requires design and analytic tools that are capable of handling the
complexity, which manifests itself above all in multiplicity of independent and dependent
variables. A similar argument can be given for psychological and sociological research.
This argument and the reality behind it impose a heavy burden on those individuals
teaching and learning research approaches and methods. It is unrealistic, even wrong, to
study and learn only an approach that is basically bivariate in conception. Multivariate
methods, however, are like the behavioral reality they try to reflect: complex and difficult
to understand. The pedagogical necessity, as far as this book is concerned, is to try to
convey the fundamentals of research thinking, design, methods, and analysis mainly
through a modified bivariate approach, to extend this approach as much as possible to
multivariate conceptions and methods, and to hope that the student will pursue matters
further after having gotten an adequate foundation.
Multiple regression, probably the single most useful form of multivariate methods,
analyzes the common and separate influences of two or more independent variables on a
dependent variable."* We gave an educational example above. The method is used simi-
larly in other kinds of behavioral research. Cutright, as we saw in an earlier chapter, used
multiple regression to study the effects of communication, urbanization, education, and
agriculture on political development.-^ Lave and Seskin used it to study the influences of
air pollution and social class on human mortality.-'* The method has been used in hun-
dreds of studies probably because of its flexibility, power, and general applicability to
many different kinds of research problems. (It also has limitations!) We can hardly ignore
it, then, in this book. Fortunately, it is not too difficult to understand and to learn to
use —
given sufficient desire to do so.
Canonical correlation is a logical extension of multiple regression. Indeed, it is a
multiple regression method. It adds more than one dependent variable to the multiple
regression model. In other words, it handles the relations between sets of independent
variables and sets of dependent variables. As such, it is a theoretically powerful method of
analysis. It has limitations, however, that can restrict its usefulness; in the interpretation
of the results it yields and in its limited ability to test theoretical models.
Discriminant analysis is also closely related to multiple regression. Its name indicates
its purpose: to discriminate groups from one another on the basis of sets of measures. It
is also useful in assigning individuals to groups on the basis of their scores on tests. While
this explanation is not adequate, it is sufficient for now.
It is difficult at this stage to characterize, even at a superficial level, the technique
known as multivariate analysis of variance, because we have not yet studied analysis of
variance. We therefore postpone its discussion.
Factor analysis is essentially different in kind and purpose from the other multivariate
methods. Its fundamental purpose is to help the researcher discover and identify the
unities or dimensions, c&Wcd factors behind many measures. We now know, for exam-
,
ple, that behind many measures of ability and intelligence lie fewer general dimensions or
factors. Verbal aptitude and mathematical aptitude are two of the best known such factors.
In measuring social economic, and educational factors have been
attitudes, religious,
found.
-^This statement has limitations, especially about the separate contributions of independent variables, that
will be discussed in Chapter 33.
'p. Cutright, "National Political Development: Measurement and Analysis." American Sociological Re-
view. 27 (1963). 229-245.
^'L. Lave and E. Seskin. "Air Pollution and Human Health." Science. 169 (1970). 723-733.
Principles of Analysis and Interpretation • 139
The above-mentioned multivariate methods are "standard" in the sense that they are
usually what meant by "multivariate methods." There are, however, other multivariate
is
methods of equal, even greater, importance. As said in the Preface, it is not possible in
I
a book of this kind to give adequate and correct technical explanations of all multivariate
methods. While enormously important, for example, analysis of covariance structures and
log-linear models analysis are far too complex and difficult to describe and explain ade-
quately. Similarly, multidimensional scaling and path analysis cannot be adequately pre-
sented. What to do. then? Some of these approaches and procedures are so powerful and
important —
indeed, they are revolutionizing behavioral research that a book that ig-—
nores them will be sadly deficient. The solution of the problem was also outlined in the
Preface. It is worth repeating. The most common and accessible approaches analysis of —
variance, multiple regression, and factor analysis —
will be presented in sufficient techni-
cal detail that a motivated and diligent student can at least use them and interpret their
results, with the aid, ofcour.se, of desk calculators or computers (especially programma-
ble calculators and microcomputers). Certain other highly complex methods like analysis
of covariance structures and log-linear models will be described and explained "concep-
tually." That is. their purpose and rationale will be explained, with generous citation and
description of fictitious and actual research use. Such an approach will be used in later
chapters with the following three methodologies.
Path cimilysis is a graphic method of studying the presumed direct and indirect influ-
ences of independent variables on each other and on dependent variables. It is a method,
in other words, of portraying and testing "theories."-'' Perhaps its main virtue is that it
examining one or two of the path analytic examples given there. It is highly likely that
path analysis will in the future be used more as an adjunct method to other more general
methods rather than as a self-contained and complete analytic method. For instance, path
analysis can be and should be used as an integral part of analysis of covariance structures,
as we will see in a later chapter.
Analysis of covariance structures — or causal modeling, or structural equation mod-
els — is the ultimate approach to the analysis of complex data means, essen-
structures. It
tially, the analysis of the varying together of variables that are in a structure dictated by
theory. For example, we can test the adequacy of theories of intelligence mentioned in
earlier chapters by fitting the theories into the analysis of covariance structure framework
and then testing how well they account for actual intelligence test data. The method or —
rather, methodology —
is an ingenious mathematical and statistical synthesis of factor
analysis, multiple regression, path analysis, and psychological measurement into a single
comprehensive system that can express and test complex theoretical formulations of re-
search problems.
Log-linear models is the ultimate multivariate method — or again, methodology — of
analyzing frequency data. The above-mentioned multivariate methods are for the most
part geared to analyzing data obtained from continuous measures: test scores, attitude and
personality scale measures, measures of ecological variables, and the like. As we will see
in the next chapter, however, behavioral research data are often in the form of frequen-
-'F. Kerlinger and E. Pedhazur. Multiple Regression in Behavioral Research. New York: Holt, Rinehart and
Winston, 1973. pp. 305ff, E. Pedhazur. Multiple Regression in Behavioral Research: Explanation and Predic-
;
tion. 2d ed. New York: Holt, Rinehart and Winston, 1982. chap 15.
140 • Analysis, Interpretation. Statistics, and Inference
cies, mostly counts of individuals, for example, numbers of males and females, blacks
and whites, teachers and non-teachers, middle- and working-class individuals. Catholics,
Protestants, and Jews. Log-linear analysis makes it possible to study complex combina-
tions of such nominal variables and, like analysis of covariance structures, to test theories
of the relations and influences of such variables on each other. We will briefly character-
ize the methodology in the next chapter, though space limitations and technical difficulties
will force us to limit the discussion to the basic ideas involved. We will at least see,
however, that, like analysis of covariance structures, it is one of the most powerful and
important methodological breakthroughs of the century.
INDICES
Index can be defined in two related ways. One. an index is an observable phenomenon
that is substituted for a less observable phenomenon. A thermometer, for example, gives
readings of numbers that stand for degrees of temperature. The numerals on a speedome-
ter dial indicate the speed of a vehicle. Test scores indicate achievement levels, verbal
aptitudes, degrees of anxiety, and so on.
A more useful to the researcher is: An index is a number that is a
definition perhaps
composite of two or more numbers. An investigator makes a series of observations, for
example, and derives some single number from the measures of the observations to sum-
marize the observations, to express them succinctly. By this definition, all sums and
averages are indices: they include in a single measure more than one measure. But the
definition also includes the idea of indices as composites of different measures. Coeffi-
cients of correlation are such indices. They combine different measures in a single mea-
sure or index.
There are indices of social-class status. For example, one can combine income, occu-
pation, and place of residence to obtain a rather good index of social class. An index of
cohesiveness can be obtained by asking a group whether they would like to
members of
stay in the group. Their responses can be combined in a single number.
Indices are most important in research. They simplify comparisons. Indeed, they
enable research workers to make comparisons that otherwise cannot bemade or that can
be made only with considerable difficulty. Raw data are usually much too complex to be
grasped and used in mathematical and statistical manipulations. They must be reduced to
manageable form. The percentage is a good example. Percentages transform raw numbers
into comparable form.
Indices generally take the form of quotients: one number is divided by another num-
ber. The most useful such indices range between and .00 or between — .00 through 1 1
to + 1 .00. This makes them independent of numbers of cases and aids comparison from
sample to sample and study to study. (They are generally expressed in decimal form.)
There are two forms of quotients: ratios and proportions. A third form, the percentage, is
a variation of the proportion.
A ratio is a composite of two numbers that relates one number to the other in fractional
or decimal form. Any fraction, any quotient, is a ratio. Either or both the numerator and
denominator of a ratio can themselves be ratios. The chief purpose and utility of a ratio is
it . example, we wished to compare the ratio of male to female high school graduates to
tor
the ratio ofmale and female graduates of junior high school over several years, the ratio
will sometimes be less than .00 and sometimes greater than .00, since it is possible that
1 1
the preponderance of one sex over the other in one year may change in another year.
Sometimes more accurate information (in a sense) than the parts of which
ratios give
they are composed. one were studying the relation between educational variables and
If
tax rate, for instance, and if one were to use actual tax rates, an erroneous notion of the
relation may be obtained. This is because tax rates on property are often misleading. Some
conuiiunities with high rates actually have relatively low levels of taxation. The assessed
valuation of property may be low. To avoid the discrepancies between one community and
another, one can calculate, for each community, the ratio of assessed valuation to true
valuation. Then an adjusted tax rate, a ""true" tax rate, can be calculated by multiplying
the tax rate in use by this fraction. This will yield a more accurate figure to use in
SOCIAL INDICATORS
Indicators, although closely related to indices — indeed, they are frequently indices as
defined above — form a special class of variables. Variables like income, life expectancy,
fertility, quality of life, educational level (of people), and environment can be called
social indicators. It is evident that these are variables. Since statistics on them are usually
^"Percentages should not be used with small numbers, though proportions may always be used. The reason
for the percentage computation restriction is that the relatively larger percentages give a sense of accuracy not
really present in the data. For example, suppose 6 and 4 are two observed frequencies. To transform these
frequencies to 60 percent and 40 percent is a bit absurd.
142 • Analysis, Interpretation, Statistics, and Inference
calculated, social indicators are both variables and statistics. Unfortunately, it is difficult
to define "social indicators, "^^' and no formal attempt will be made here to do so.
Readers should know, however, that the idea of social indicators is important and is likely
to become increasingly important in the future. Their use is expanding into all fields and
eventually they will be systematically studied from a scientific viewpoint, as well as from
a "public" and social viewpoint.
In this book we are interested in social indicators as a class of sociological and psycho-
logical variables that in the future may be useful in developing and testing scientific
theories of the relations among social and psychological phenomena. Certain social indi-
cators, for example, are now used in so-called causal modeling studies of educational and
occupational achievement —
social class, parents' occupation, and earnings, for ex-
ample.^" Psychological indicators, such as perceived quality of life, or "happiness,"
have also been used.''' In general, however, there seems to have been little systematic
methodological work done to categorize and study social indicators, their relations to each
other,and their relations to other variables. Most of the work can be called demographic
and narrowly pragmatic —
in essence, descriptive. Nevertheless, the field, after problems
of reliability and validity are addressed and perhaps solved, ''* is richly promising and
should, within a decade, offer behavioral scientists more than such statistics as 51.2
percent of the population was female in 1976, or 54 percent of the population over 18 had
nine to twelve years of education. Instead, we can expect factor analytic studies of indica-
tors, analysis of covariance studies in which indicators are variables of the analyzed
structures, and an increasing general use of the idea of indicators in social and psychologi-
cal research. One can easily see this in educational research where the achievement of
children appears to be affected in complex ways by different kinds of variables, some of
them of the social indicator kind.'''' One of the virtues of the social indicator movement is
that these influences on achievement will be more consciously and systematically used in
studying and testing theories of achievement.'*
"R. Jaeger, "About Educational Indicators; Statistics on the Conditions and Trends in Education." In L.
Shulman, ed.. Review ofReseach in Eihiculion. vol. 6. Itasca. HI.: Peacock Publishers. 1978. chap. 7. (The title
book on social indicators is; R. Bauer, ed.. Social Indicators. Cambridge, Mass.; MIT
^''The pioneering
Press. 1966.The Jaeger article, cited earlier, is a good introduction. For sources, see; O. Duncan. Toward Social
Reporting: Ne.xt Steps. New York; Russell Sage Foundation. 1969. The following article outlines the develop-
ment of this new field; E. Sheldon and R. Parke. "Social Indicators," Science. 188 (1975). 693-699. An
important discussion of social indicators is; A. Campbell. "Subjective Measures of Weil-Being." American
Psychologist. 31 (1976), 17-124. The field is disappointing in one sense, however, because the emphasis has
1
been almost wholly on the use of indicators in descriptive research and not in scientific-explanatory research.
Principles of Analysis and Interpretation • 143
agree on the data ot reinlorcement experiments. Yet they disagree vigorously on the
interpretation of the data of the experiments. Such disagreements are in part a function of
theory. In a book like this we cannot labor interpretation from theoretical standpoints. We
must be content with a more limited objective: the clarification of some common precepts
of the interpretation of data within a particular research study or series of studies.
One of the major themes of this book is the appropriateness of methodology to the prob-
lem under investigation. The researcher usually has a choice of research designs, methods
of observation, methods of measurement, and types of analysis. All of these must be
congruent: they must fit together. One does not use, for example, an analysis appropriate
to frequencies with, say, the continuous measures yielded by an attitude scale. Most
important, the design, methods of observation, measurement, and statistical analysis must
all be appropriate to the research problem.
anxiety when the problem variable is really general anxiety. Similarly, he must ask him-
self whether his measure of achievement is valid for the research purpose. If the research
problem demands application of principles but the measure of achievement is a standard-
ized test that emphasizes factual knowledge, the interpretation of the data can be errone-
ous.
"See J. Robinson. J. Rusk, and K. Head. Measures of Political Attitudes . Ann Arbor, Michigan: Institute
for Social Research. University of Michigan, 1968, chap. 13. Survey Research Center national survey data have
been used by others. Thus, if the measurement is inadequate, the inadequacy spreads. See. for example; N. Nie.
J. Verba, and J. Petrocik. The Changing American Voter. Cambridge. Mass.; Harvard University Press, 1976.
"One of the most influential of such sweeping conclusions is contained in: P. Converse, "The Nature of
Belief Systems in Mass Publics." In D. Apter. ed.. Ideology and Discontent New York: Free Press. 1964. pp.
.
206-261 . Converse says, in effect, that the American mass public has no systematic attitude structure. He also
says that the liberal-conservative continuum [sic] is a higher-order abstraction that the man-in-the-street knows
little about. Ignoring the empirical validity of either claim, note that both conclusions were based on the analysis
of relatively few attitude items with a restricted range of content.
144 • Analysis, Interpretation, Statistics, and Inference
In other words, we face here the obvious, but too easily overlooked, fact that ade-
quacy of interpretation is dependent on each link in the methodological chain, as well as
on the appropriateness of each link to the research problem and the congruence of the links
to each other. This is clearly seen when we are faced with negative or inconclusive results.
Negative or inconclusive results are much harder to interpret than positive results. When
results are positive, when the data support the hypotheses, one interprets the data along
the lines of the theory and the reasoning behind the hypotheses. Although one carefully
asks critical questions, upheld predictions are evidence for the validity of the reasoning
behind the problem statement.
This is one of the great virtues of scientific prediction.When we predict something
and plan and execute a scheme for testing the prediction, and things turn out as we say
they will, then the adequacy of our reasoning and our execution seems supported. We are
never sure, of course. The outcome, though predicted, may be as it is for reasons quite
other than those we fondly espouse. Still, that the whole complex chain of theory, deduc-
tion from theory, design, methodology, measurement, and analysis has led to a predicted
outcome is cogent evidence for the adequacy of the whole structure. We make a complex
bet with the odds against us, so to speak. We then throw the research dice or spin the
research wheel. If our predicted number comes up, the reasoning and the execution lead-
ing to the successful prediction would seem to be adequate. If we can repeat the feat, then
the evidence of adequacy even more convincing.
is
But now Why were the results negative? Why did the results
take the negative case.
not come out as predicted? Note that any weak link in the research chain can cause
negative results. They can be due to any one, or several, or all of the following: incorrect
theory and hypotheses, inappropriate or incorrect methodology, inadequate or poor meas-
urement, and faulty analysis. All these must be carefully examined. All must be scruti-
nized and the negative results laid at the door of one, several, or all of them. If we can be
fairly sure that the methodology, the measurement, and the analysis are adequate, then
negative results can be definite contributions to scientific advance, since only then can we
have some confidence that our hypotheses are not correct.
The testing of hypothesized relations is strongly emphasized in this book. This does not
mean, however, that other relations in the data are not sought and tested. Quite the
contrary. Practicing researchers are always keen to seek out and study relations in their
data. relation may be an important key to deeper understanding of theory.
The unpredicted
It may throw on aspects of the problem not anticipated when the problem was
light
formulated. Therefore researchers, while emphasizing hypothesized relations, should
always be alert to unanticipated relations in their data.
Suppose we have hypothesized that the homogeneous grouping of pupils will be
beneficial to bright pupils but not beneficial to pupils of lesser ability. The hypothesis is
upheld, say. But we notice an apparent difference between suburban and rural areas; the
relation seems stronger in the suburban areas; it is reversed in some rural areas! We
analyze the data using the suburban-rural variable. We find that homogeneous grouping
seems to have a marked influence on bright children in the suburbs, but that it has little or
no influence in rural areas. This would be an important finding indeed.
Principles of Analysis and Interpretation • 145
Study Suggestions
1. Suppose you wish to study the relation between social class and test anxiety. What are the
two main possibilities for analyzing the data (omitting the possibility of calculating a coefficient of
correlation)? Set up two analytic strucnires.
2. Assume that you want to add sex as a variable to the problem above. Set up the two kinds of
analytic paradigms.
3. Suppose an investigator has tested the effects of three methods of teaching reading on
"G. Bower and E. Hilgard, Theories of Learning. 5th ed. New York: Appleton-Century-Crofls. 1981,
chap. 15.
"M. Lepper, D. Greene, and R. Nisbett, "Undermining Children's Intrinsic Interest with Extrinsic Reward:
A Test of the Overjustification Hypothesis," Journal of Personality and Social Psychology, 28 (1973), 129-
137. Condry has competently reviewed research on extrinsic and intrinsic motivation: J. Condry, "Enemies of
Exploration: Self-Initiated versus Other-Initiated Learning." Journal of Personality and Social Psychology, 35
(1977). 459-477.
146 • Analysis, Interpretation, Statistics, and Inference
reading achievement. He had 30 subjects in each group and a reading achievement score for each
subject. He also included sex as an independent variable: half the subjects were male and half
female. What does his analytic paradigm look like? What goes into the cells?
4. Study Figure 9.3. Do these analysis of variance designs or paradigms represent partitioning
of variables? Why? Why is partitioning important in setting up research designs and in analyzing
data? Do the rules of categorization (and partitioning) have any effect on the interpretation of data?
If so, what effects might they have? (Consider the effects of violations of the two basic partitioning
rules.)
Chapter io
The Analysis of
Frequencies
So FAR, we have talked mostly about analysis. Now we learn how to do analysis. The
simplest way to analyze data to study relations is by cross-partitioning frequencies. A
cross partition, as we learned in Chapter 4, is a new partitioning of the set U by forming
all subsets of the form A D B. That is, we form subsets of the form A H B from the known
subsets A and B of U. Examples were given in Chapter 4; more will be given shortly. The
expression "cross partition" refers to an abstract process of set theory. Now, however,
when the cross partition idea is applied to the analysis of frequencies to study relations
between variables, we call the cross partitions crossbreaks. The kind of analysis to be
shown is also called contingency analysis, or contingency table analysis.
We can no longer get along without statistics. So we introduce a form of statistical
analysis commonly associated with frequencies, the x^ (chi square) test, and the idea of
statistical "significance." This study of crossbreaks and x' should help ease us into
statistics.
Table 10.1 Relation Between Political Party Affiliation and Welfare Vote
(Moynihan Amendment), U. S. Senate, 1981
Yea
Republican
Democrat
The Analysis of Frequencies • 149
What we have called categorical variables are also called, perhaps more accurately,
'"nominal variables." This is because they belong to what we will later learn is the level
of measurement called "nominal." Since in this and later chapters we have to be quite
clear about the difference between continuous and categorical variables, let us briefly
anticipate a later di.scussion and define measurement. When the numbers or symbols
assigned to objects have no number meaning beyond presence or absence of the property
or attribute being measured, the measurement is called "nominal." A variable that is
nominal is, of course, what we have been calling "categorical." To name something
("nominal") is to put it into a category ("categorical").
All this is perhaps clarified by the following set equation, which is a general definition
of measurement:-^
which is read:/is a rule of correspondence that is defined as a set of ordered pairs, (x, y),
such that .V is some object and y is some numeral assigned to .v. This is a general definition
that covers all cases of measurement. Obviously, y can be a set of continuous measures or
simply the set {0, 1}. Categorical or nominal variables are those variables where y =
{0, 1}, and 1 being assigned on the basis of the object .x either possessing or not
possessing some defined property or attribute. Continuous variables are those variables
where y = {0, 1, 2 k}. or some numerical system where the numbers mean more
or less of the attribute in question. mathematically difficult to define "continuous
(It is
measures," and the definition just given is not satisfactory. Nevertheless, the reader will
know what is meant.)
The level of measurement of this chapter is mostly nominal. Even when continuous
variables are used, they are converted to nominal variables. In general, this should not be
done, because it throws information (variance) away. Nevertheless, there are times when,
'F. Kerlinger, "Research in Education." In R. Ebel, V. Noll, and R. Bauer, eds.. Encyclopedia of
Educational Research. 4th ed. New York: Macmillan. 1969. pp. 1127-1114 (1137).
"Crossbreaks are also used in descriptive ways. The investigator may not be interested in relations, as such:
he may want only to describe a situation that exists. For instance, take the case where a table breaks social-class
membership against possession of TV sets, refrigerators, and so on. This is a descriptive comparison rather than
a variable crossbreak. even though we might conceivably call possession of a TV by .some
set. for instance,
variable name. Our concern is exclusively with the analysis of data gathered to test or explore relations.
150 • Analysis, Interpretation, Statistics, and Inference
variables. But they have other side purposes. They can be used to organize data in con-
venient form for statistical analysis. A statistical test is then applied to the data. Indices of
association, too, are readily calculated.
Another purpose of crossbreaks is to control variables. As we will see later,
a third variable. In this way '"spurious" relations can be unmasked and the relations
between variables can be "specified" — that is, differences in degree of relation at differ-
given above. A third example is given in Table 10.3. The data are from a study of what the
authors call the New Left, or New Liberals, and the Silent Minority.' The two rubrics
amount to "liberal" and "conservative." Call this variable ideology. The tabled data are
the numbers of respondents who felt that the responsibility for the condition of the poor
was either the poor themselves or the society. Appropriate percentages (proportions) have
been calculated and entered in the cells. It is clear that there is a strong relation between
ideology and attribution of responsibility; The New Liberals assign responsibility more to
Responsibility Attribution
Ideology Poor Society
Silent Minority
The Analysis of Frequencies • 151
/4| /1,S,
A.
152 • Analysis, Interpretation, Statistics, and Inference
Ci
Ax
A,
The Analysis of Frequencies • 153
simple. At any rate, in all three tables we calculate the percentages across the rows, or
from the independent variable (rows) to the dependent variable (columns).
To be sure we know what we are doing, let's calculate the percentages of Table 10.3.
Take the rows separately: the New Liberal row; 46/143 = .32. and 97/143 = .68. These
arc the proportions. Multiplying by 100 —
which amounts to moving the decimal point
two places to the right —
yields, of course, 32 percent and 68 percent. Now the Silent
Minority row: 29/35 = .83, and 6/35 = .17, or 83 percent and 17 percent. (Notice that
each row must total 1.00, or 100 percent.) The relation is now clear: the Silent Minority
attributes responsibility for poverty to the poor, whereas the New Liberals tend to attribute
responsibility to society. Notice how the percentage crossbreak highlights the relation,
which is not clear in the frequencies because of unequal numbers of New Liberals (143)
and Silent Minority (35). In other words, the percentage calculation transforms both rows
to a common base and enhances the comparison and the relation. —
The reader may wonder about two things: Why not calculate the percentages the other
way: from the dependent variable to the independent variable? Why not calculate the
percentages over the whole table? There is nothing inherently wrong with either of these
calculations. In the first case, however, we would be asking the data a different question.
In the second case, we merely transform the frequency data to percentage or proportion
data without changing the pattern of the frequencies.
The Miller-Levitin problem is pointed toward accounting for attribution of responsi-
bility for poverty —
to the poor or to society. An hypothesis implied by the problem is:
Conservatives attribute responsibility for poverty to the poor. This is a statement of the if
p. then (] kind: If conservative, then attribution of responsibility to the poor. There can be
no doubt of the independent and dependent variables. Therefore the calculation of the
percentages is determined, since we must ask: Given conservatism, what proportion of
responses is attribution of responsibility to the poor? The question is answered in the
second row of Table 10.3: .83, or 83 percent. (The first row is of course also important in
the overall relation.) If the percentages are calculated down the columns, this is tanta-
mount to the hypothesis: If attribution of responsibility for poverty is to the poor, then
conservative ideology. But we are not trying to account for ideology; ideology is not the
dependent variable. Naturally, if we go ahead anyway and calculate percentages, they will
be misleading. (See Study Suggestion 3.)^
Chapter 7), whose correct statements are derived from the research problem. For example, for Table 10.1 we can
say: If Republican, then vote Nay, a conditional statement. In set and probability theory language, this is: the
probability of B2, a Nay vote, given Ai. Republican, or:
I
. ,
P(Ai n B2) 48/98
and of B^. given /ii. It is also the percentage in the ,4,82 cell
this is the conditional probability: the probability
of Table 10. For a more complete discussion, see F. Kerlinger, Foundations of Behavioral Research, 2d ed.
1 .
Look at the frequencies of Table 10.3. Do they really express a relation between
ideology and attribution of responsibility for poverty? Or could they have happened by
chance? Are they one pattern among many patterns of frequencies that one would get
picking numbers from a table of random numbers, such selection being limited only by the
given marginal frequencies? Such questions have to be asked of every set of frequency
results obtained from samples. Until they are answered, there is little or no point in going
further with data interpretation. If our results could have happened by chance, of what use
is our effort to interpret them?
What does it mean to say that an obtained result is "statistically significant" that it —
departs "significantly" from chance expectation? Suppose that we were to do an actual
experiment 100 times just as we toss a coin 100 times. Each experiment is like a coin toss
or a throw of the dice. The outcome of each experiment can be considered a sample point.
The sample space, properly conceived, is an infinite number of such experiments or
sample points. For convenience, we conceive of the 100 replications of the experiment as
the sample space U This is nothing new. It is what we did with the coins and the dice.
.
Approve Disapprove
/„ 60 40
/, 50 50
/„-/. 10 -10
{fo-fei^ 100 100
such a large discrepancy, it" it ix a large discrepancy, happen by chance? The x' test is a
convenient way to get an answer.
We now write a x^ formula:
if.-fe)-
r fe
which simply says: "'Subtract each expected frequency, /,., from the comparable obtained
frequency./,, square this difference, divide the difference squared by the expected fre-
quency, /r. and then add up these quotients."" This was done in Table 10.4. To make sure
the reader knows what is happening, we write it out:
But what does ;^^ = 4 mean? ;^^ is a measure of the departure of obtained frequencies
from the frequencies expected by chance. Provided we have some way of knowing what
the chance expectations are, and provided the observations are independent, we can al-
ways calculate x'- The larger )^ is the greater the obtained frequencies deviate from the
expected chance frequencies. The value of ;^ ranges from 0, which indicates no departure
of obtained from expected frequencies, through a large number of increasing values.
In addition to the formula above, know the so-called degrees of
it is necessary to
freedom of the problem and have a )c
Chi square tables are found in almost any
to table.
statistics book, together with instructions on how to use them. So are explanations of
degrees of freedom. We may say here that "degrees of freedom"" means the latitude of
variation a statistical problem has. In the problem above, there is one degree of freedom
because the total number of cases is fixed, 100, and as soon as one of the frequencies is
given the other is immediately determined. That is, there are no degrees of freedom when
two numbers must sum to 100 and one of them, say 40, is given. Once 40, or 45, or any
other number is given, there are no more places to go. The remaining number has no
freedom to vary.^
To understand more about what is going on here, suppose we calculate all the ;^^'s for
all possibilities: 40/60, 41/59, 42/58 50/50, . . . , 60/40. Doing so, we get the set
of values given in Table 10.5. (In reading the table, it is helpful to conceive of the first
frequency of each pair as "Heads," or "Agrees with," or "Male,"" or any other varia-
ble.) Only two of these ;^^'s, the values of 4.00 associated with 40/60 and 60/40, are
statistically significant.They are statistically significant because by checking the x^ table
for one degree of freedom we find an entry of 3.841 at what is called the .05 level of
significance. All the other ;^^'s in Table 10.5 are less than 3.841. Take the )C for 42/58,
which is 2.56. If we consult the table, 2.56 falls between the values of ;^ with probabili-
ties of .10 and .25, or 2.706 and 1.323, respectively. This is actually a probability of
about 14. In most cases, we do not need to bother finding out where it falls. All we need
.
to do is to note that it does not make the .05 grade of 3.841 If it does not, we say that it .
is not statistically significant at the .05 level. The reader may now ask: "What is the .05
if one calculates a mean of 100 scores, one uses up one degree of freedom, by imposing one
'Similarly,
restrictionon the data. Probably the best explanation of degrees of freedom is Walker's. See H. Walker.
"Degrees of Freedom." Journal of Educational Psychology. 31 (1940). 253-269, and H. Walker. Malhemalics
Essential for Elementary Statistics, rev. ed. New York: Holt. Rinehart and Winston. 1951. chap. 22.
156 • Analysis, Interpretation, Statistics, and Inference
Frequencies
The Analysis ot Frequencies • 157
difference between the means, 5 of these 100 replications might show differences as large
as the actual ohtained differences.
While this discussion may help to clarify the meaning of statistical significance, it
does not yet answer all the questions asked before. The .05 level was originally chosen
and has persisted with researchers — because it is considered a reasonably good gamble. It
is neither too high nor too low for most social scientific research. Many researchers prefer
the .01 level of significance. This is quite a high level of certainty. Indeed, it is "practical
certainty." Some researchers say that the .10 level may sometimes be used. Others say
that 10 chances in 100 are too many, so that they are not willing to risk a decision with
such odds. Others say that the .01 level, or chance in 100, is too stringent, that "really""
1
Should a certain level of significance be chosen and adhered to? This is a difficult
question. The .05 and .01 levels have been widely advocated. There is a newer trend of
thinking that advocates reporting the significance levels of all results. That is. if a result is
significant at the .12 level, say, it should be reported accordingly. Some practitioners
object to this practice. They say that one should make a bet and stick to it. Another school
of thought advocates working with what are called "confidence intervals."'' In this book,
"
the statistical "levels"" approach will be used because it is simpler. For the student who
does not plan to do any research, the matter is not serious. But it is emphasized that those
who w ill engage in research should study other procedures, such as statistical estimation
methods, confidence intervals, and exact probability methods.
To and use of the x- test with crossbreaks, we now apply it
illustrate the calculation
to the frequency data of Table 10.3. The formula given above is used, but with crossbreak
tables its application is use in Table 10.4. The main difference
more complicated than its
tme X 2 tables. The X' formula simply requires squaring these differences, dividing
in 2
the squares by the expected frequencies, and summing the results. These calculations are
indicated below, x — (Why one degree of freedom?)
65. 1863, at one degree of freedom.
Looking up one degree of freedom at the .01 level, we read 6.635.
the tabled y- value,
Since our value exceeds this substantially, it can be said that x' is statistically significant,
the obtained results are probably not chance results, and the relation expressed in the table
'°
is a "real" one in the sense that it is probably not due to chance.
''Most investigators say that the results are not significant if they do not make the .05 or .01 grade. For a
penetrating discussion of this obviously difficult issue, which cannot be adequately discussed here, see W.
Rozeboom. "The Fallacy of the Null-Hypothesis Significance Test." Psychological BuUelin. 57 (1960). 416-
428. Rozeboom advocates the use of confidence intervals and the reporting of precise probability values of
experimental outcomes. See also J. Nunnally. "The Place of Statistics in Psychology." Educational and Psy-
chological Measiiremeni. 20 ( 1960), 641-650. The basic idea is that, instead of categorically rejecting hypothe-
ses if the .05 grade is not made, we say the probability is .95 that the unknown value falls between .30 and .50.
Now, if the obtained empirical proportion is. say. .60. then this is evidence for the correctness of the investiga-
tor's substantive hypothesis, or in null hypothesis language, the null hypothesis is rejected. A convenient and
excellent source of these and similar problems is: R. Kirk, ed., Slaiislical Issues: A Reader for the Behavioral
Sciences. Monterey, Calif.: Brooks/Cole. 1972. chap 4. See. especially, essays by Chandler, by Edwards, and
by Lykken.
'"Note that x~ needs a correction if /V is small. The approximate rule is that the so-called correction for
continuityis used —
it consists merely of subtracting .5 from the absolute difference between /o and/,, in the
x'^
formula before squaring —
when expected frequencies are less than 5 in 2 x 2 tables.
158 • Analysis, Interpretation, Statistics, and Inference
24.8776^
5
-19.8776"
The Analysis of Frequencies • 159
calculate C (or other measures: see footnote II), calculate the percentages as outlined
earlier, and then interpret the data using all the information.
bles: a one-dimensional table has one variable, a two-dimensional table has two variables,
and so on. It makes no difference how many categories any single variable has; the
dimensions of a table are always fixed by the number of variables. We have already
considered the two-dimensional table where two variables, one independent and one
dependent, are set against each other. It is often fruitful and necessary to consider more
than two variables simultaneously. Theoretically, there is no limit to the number of varia-
bles that can be considered at one time. The only limitations are practical ones: insuffi-
cientsample size and difficulty of comprehension of the relations contained in a multidi-
mensional table.
One-Dimensional Tables
There are two kinds of one-dimensional table. One is a "true" one-dimensional table; it
is of little interest to us since it does not express a relation. Such tables occur commonly in
we have "true" one-dimensional tables. One variable only is used in the table.
Social scientists sometimes choose to report their data in tables that look one-dimen-
sional but are really two-dimensional. Consider a table reported by Child, Potter, and
Levine.'- In this study the values expressed in third-grade children's textbooks were
content-analyzed. Table 10.7 shows the percentages of instances in which rewards were
given for various modes of acquisition. (In the original table, only the column of percent-
ages on the left was given.) The table looks one-dimensional, but it really expresses a
relation between two variables, mode of acquisition and reward.
The key point is that tables of this kind are not really one-dimensional. In Table 10.7,
one of the variables, reward, is incompletely expressed. To make this clear, simply add
another column of percentages beside those in the original table. (This has been done in
the table.) This column can be labeled "Not Rewarded." Now we have a complete
two-dimensional table, and the relation becomes obvious. (Sometimes this cannot be done
because data for "completing" the table are lacking.)
% in which (% in which
Mode of Acquisition rewarded not rewarded)
Effort 93 (7)
Buying. Selling. Trading 80 (20)
Asking, Wishing, Taking What Is Offered 68 (32)
Dominance, Aggression, Stealing, Trickery 41 (59)
'-I. Child, E. Potter, and E. Levine, "Children's Textbooks and Personality Development: An Exploration
in the Social Psychology of Education." Psychological Monographs. 60 (1946). No. 3.
160 • Analysis, Interpretation, Statistics, and Inference
Two-Dimensional Tables
Two-dimensional tables or crossbreaks have two variables, each with two or more sub-
classes. The simplest form of a two-dimensional table, as we have seen, is called two-by-
two, or simply 2x2. Two-dimensional tables are by no means limited to the 2 x 2 form.
In fact, there is no logical limitation on the number of subclasses that each variable can
have. Let us look at a few examples of m x n tables.
Baum and Greenberg studied the effects of anticipated crowding on seating be-
havior.'-^ They reported the 2 x 3 crossbreak of Table 10.8. The results support their
hypothesis, which was: Persons who crowding will sit in comers of rooms, as
anticipate
contrasted to persons who do not anticipate crowding, x^ = 39.42, which is highly signif-
icant. C = .58, a substantial relation. (The authors did not calculate a measure of associa-
tion.) We see here a simple but effective method of testing the hypothesis and analyzing
the data. Under the anticipation of crowding, the subjects sat in comers, as contrasted to
subjects who did not anticipate crowding.
Expectation
The Analysis of Frequencies • 161
Table 10.11
The Analysis of Frequencies • 163
SC NSC
Hi LA 140 60 200
LoL4 60 140 200
level-of-aspiration measures. At the end of three years he further categorized the students
on the basis of having graduated or not. Suppose the results were those shown in Table
10.12.'^ There is evidently a relation between the variables;
x^ = 64, significant at the
.001 level, and C = .37.
The investigator shows the results to a colleague, a rather sour individual, who says
they are questionable, that if were brought into the picture the relation inight
social class
be quite different. He reasons that social class and level of aspiration are strongly related,
and that the original relation might hold for middle-class students but not for working-
class students. Disconcerted, the investigator goes back to his data, and, since he luckily
has indices of social class for all the subjects, he finds, when he works out the three-
variable crossbreak, the results shown in Table 10.13. Inspection of the data shows
that the investigator's colleague was right: the relation between level of aspiration and
success in college is considerably more pronounced with middle-class students than with
working-class students.
The investigator can study the relations in more depth by calculating percentages
separately for the middle-class and working-class sides of Table 10.13. In this case, since
the frequencies in each row ofthe halves of the table total to 100, the frequencies
are, in effect, percentages. can be seen that the relation between level of aspiration and
It
Hi LA
SC
NSC
164 • Analysis, Interpretation, Statistics, and Inference
an index that expresses the magnitude of a relation. A crossbreak expresses the ordered
pairs in a table of frequencies.
To show how these ideas are related, take the fictitious data of Table 10.14. The
relation studied between state control of the economic system and political democracy.
is
In a study of political democracy in modem nations, Bollen hypothesized that the greater
the control of the economic system of a country, the lower its level of political
democracy. '^ Suppose that of a sample of 23 countries, we count 12 countries with low
economic control (Low EC) and 1 1 countries with high economic control (High EC). We
also count 13 countries with high political development (High PD) and 10 countries with
low political development (Low PD). This gives us the marginal totals of a 2 x 2 cross-
break. It does not tell us how many countries are in each of the cells, however.
We now count the number of Low EC countries that have High PD and the number of
High PD
Low EC
High EC
The Analysis of Frequencies • 165
High ECcountries that have Low PD. These counts are entered in the appropriate cells of
the 2 X 2 crossbreak of Table 10. 14. We find that the cell frequencies depart significantly
from chance expectation.''' There is thus a significant relation between state economic
control and political development.
So that we can see the ordered pairs clearly, let's change the variable notation. Let
A, = Low EC, A: = High EC, fi, = High PD, and B. = Low PD. The A and B labels
have been appropriately inserted in Table 10. 14. Now, how do we set up the ordered pairs
of the crossbreak? We do so by assigning each of the 23 countries one of the following
subset combinations: (1,1), (0.1), (1,0), (0,0). (See the designations in Table 10.14.) In
other words. A, and B, are assigned 1 's, and A2 and B2 are assigned O's. If a country has
Low EC and High PD, then it AiB,; consequently the ordered pair assigned to it is
is
(1,1). The first 10 countries of Table 10.15 belong to the A,fi| category and are thus
assigned (1,1). Similarly, the remaining countries are assigned ordered pairs of numbers
according to their subset membership. The full list of 23 ordered pairs is given in Table
10.15. The categories or crossbreak (set) intersections have been indicated.
Countries
1
166 • Analysis, Interpretation, Statistics, and Inference
5,(1)
52(0)
.42(0) ^,(1)
A
Figure 10.4
product-moment r, of the Table 10.15 data, we obtain .56. (The product-moment r calcu-
lated with I's and O's is called a phi (</>) coefficient. ~°)
Graph the relation. Let there be two axes, A and B. at right angles to each other, and
let A and B represent the two variables of Tables 10. 14 and 10. 15. We are interested in
studying the relation between A and B. Figure 10.4 shows the graphed ordered pairs. It
also shows a "relation" line run through the larger clusters of pairs. Where is the relation?
We ask: Is there a set of ordered pairs that defines a significant relation between A and B?
We have paired each individual's score on A with his "score" on B and plotted the pairs
on the A and B axes. Going back to the substance of the relation, we pair each individual
country's "score" on economic control with its "score" on political development. In this
manner we obtain a set of ordered pairs and this set is a relation. Our real question,
however, is not: Is there a relation between A and B? but rather: What is the nature of the
relation between A and B?
We can see from Figure 10.4 that the relation between A and B is fairly strong. This is
determined by the ordered pairs: the pairs are mostly (ai^i) and (ajb^). There are compar-
atively few (ai/jt) and (02^1) pairs. In words, low economic control scores pair with high
political development scores (1,1), and high economic control scores pair with low politi-
cal development scores (0,0), with comparatively few exceptions (5 cases out of 23). We
cannot name this relation succinctly, as we can such relations as "marriage," "brother-
hood," and the like. We might, however, call it "state economic control-political devel-
opment," meaning that there is a relation of these variables in the ordered-pair sense.
ADDENDUM
Multivariate Analysis of Frequency Data: Log-Linear Models
Most of the above discussion was limited to two variables, an independent variable and a
dependent variable. Many frequency data analyses, however, are of three and more varia-
bles. A fictitious example with three variables was given earlier in Table 10.13. While
most three- variable cases can be analyzed and interpreted using percentages, study data
^"This is not a recommended procedure. It is used here to help clarify analytic procedures and not to
illustrate how <^ is calculated.
The Analysis of Frequencies • 167
with four or more variables are not so amenable to analysis and interpretation. Another
approach is needed. Even with three variables another approach is often needed because
the data are too complex and subtle for simple interpretation. With a two- variable cross-
break there is only one relation: between A and B. With three variables, however, there
are four relations of possible interest: AB, AC, BC, and ABC. The three two- variable
crossbreaks arc the kind we have been The one three-variable crossbreak, ABC.
studying.
is like that shown in Table 10. 13, and in this case can be viewed most fruitfully as follows:
study of the relation between level of aspiration and success in college in two samples:
middle class and working class. That is, we study whether the relation between level of
aspiration and college success is the same in the middle class as it is in the working class.
If it is the same we have "established" an invariance. If it is different, however, we have
"
an interaction: the relation is such-and-such in the middle class, but it is so-and-so in the
working class.
In the last decade or so, remarkable changes in conceptualization of research problems
and in data analysis have taken place. Before the development of multivariate analysis of
both continuous measures and frequencies, analysis —
and the conceptualization of analy-
sis — was mostly between pairs of variables,
bivariate. Investigators studied the relations
as we have pretty much done
Chapter 10. While the idea of studying the operation of
in
several variables simultaneously was well-known, the practical means of doing so had to
wait for both the computer and a different way of thinking. Later in this book we will
examine the nature of the computer and its important role in research. We want now to
explore, but briefly and only in an introductory way, part of the newer kind of thinking in
behavioral research in relation to frequency analysis.
Doing research is in effect setting up models of what "reality" is supposed to be and
then testing the models against empirical data. The trouble is that the world is which
influences and variables operate is almost always complex, and scientists are always
limited to aspects of this complexity. The whole "reality" of anything is forever beyond
reach. Perhaps we can only rarely say that A is B
all circum-
related to in all times and
stances. Indeed, the variables A and B are themselves complex.
have already seen We
this. Intelligence, achievement, level of aspiration, social class, and political develop-
ment, for example, are all complex ideas that reflect the natural complexity of the behav-
ioral world. Now add to this individual variable complexity the additional complexity of
the relations among the variables and one wonders how it is possible to study the com-
plexity and actually advance knowledge. Indeed, in the face of these difficulties it is
remarkable that science has been as successful as it has been.
In the contemporary approach to multivariate data, we test data to determine what
model fits the data. A model is an abstract outline specifying hypothesized relations in a
set of data.-' Let's use the data and the paradigm of Table 10.13 again. Let 5 = success
(in college); L = level of aspiration; C= social class, the three variables of the table. If
I believe that the relation between level of aspiration and success in college is the same in
the middle class and in the working class, then I am hypothesizing a certain complex
outcome; I am specifying a model. For example, if I believe that level of aspiration (L)
"determines" or "influences" success in college (5), without regard to social class (C), I
LS
This means that L and S are related, and that they have the same relation in both social
classes. It also means that social class and success in college are not related and that social
SLC
This means that success in college (S) and the level of aspiration (L) are related, but that
they are somehow related differently in the two social classes (C). The difference may
take one of several forms. One, the relation (SL) may exist in both social classes, but it
may be stronger in one social class than in the other, as in Table 10. 13. Two, the relation
may exist in one social class but not in the other. And three, the relation may take a certain
form (positive, for example) in one social class but a different form (negative, for exam-
ple) in the other social class. The SLC means that all three variables have to be taken into
account in order for the model to be satisfactory (fit the data).
This should be sufficient discussion for a preliminary conceptual idea of an elementary
multivariate approach to frequency data. A crucial idea in contemporary methodological
approaches to frequency data analysis is that of "fit." One proposes a model that springs
from a theory, or that somehow seems reasonable. One sets up the model and, through
appropriate analysis, one generates data consistent with the model. Then one compares
the generated data to the actual data. If they are alike, then the model fits. If they are
different, then the model does not fit. It so happens that the only model that fits the data of
Table 10.13 is that just given: SLC. And study of the table (and appropriate computer
output) shows that the first possibility outlined in the preceding paragraph seems "cor-
rect": the relation between level of aspiration and success in college "exists" in both the
middle class and the working class, and it is stronger in the middle class than it is in the
working class.
When crossbreak data tables have more than three variables they become difficult to
analyze because of the many possible relations. With two variables, there is only the
relation AB to examine, as was pointed out earlier. With three variables, the possible
relations are: AB, AC, BC. ABC. There are three first-order relations and one second-order
relation. With four variables, there are many more: 1 1 in all. AB, AC. AD, BC, CD are the
first-order possibilities; ABC, ABD, ACD, BCD are the second-order possible relations:
ABCD is the third-order possibility. (Each of these would have a table like Table 10.13
or larger.) Obviously frequency data studies with four variables can be very complex.
With five or more variables, the numbers of relations become so high as to be virtually
unmanageable. Fortunately, problems in the published literature rarely have more than
four variables. This is in sharp contrast to studies with continuous measures and continu-
ous measure analysis where five and more variables are common.
As said earlier, another approach and another mode of analysis are needed. There are
several such approaches and modes of analysis, but the most useful and most-used is
called log-linear analysis. Without more background it is not possible to describe and
explain this approach adequately. We can try. however, to take the mystery out of the
name and to characterize the approach. Recall that to calculate x^ it was necessary to
calculate expected frequencies. Take an easy example from a Senate vote on the budget
reported in the New York Times.'^ This vote is given in Table 10.16, together with
calculations of expected frequencies. The frequency of the first cell to be expected on the
--In doing an actual analysis, this model is not stated correctly in the sense that it does not follow conven-
tional log-linear rules. It is. however, good enough for our pedagogical purpose.
'-'New York Times, May 13, 1981. The title of the article was; "Compliant but Reluctant Democrats."
The Analysis of Frequencies • 169
basis of chance is 46/98ths of 78, or (46)(78)/98 = 36.612. The expected frequencies for
the other three cells are calculated similarly (see footnote a of Table 10. 16). It is possible
to do an analysis based on such multiplicative calculations.^''
For Against
36.612"
Democrat
28
Republican
170 • Analysis, Interpretation, Statistics, and Inference
This statesthat, to account for the data, all possible relations oiA.B, and C must be taken
into account. -^ Strictly speaking, this is not literally true because some of the terms
equalled zero. For instance, the relation between social class and college success, and the
differences between the marginal frequencies were all zero. So, a more accurate way to
write the model would be:
which means, in effect, that level of aspiration and college success are related, and that
they are differently related in the middle class and the working class (see Table 10.13).
The rules of log-linear analysis, however, require the specification of all the terms that
"precede" the highest order term, uabc-
In log-linear analysis, the u's of whatever model is specified (hypothesized) are calcu-
lated and called "effects." If they are significantly different from zero, they indicate
relations (except u^, ub, and Uc, which only express the differences in marginal frequen-
cies). The point is that log-linear analysis shows where relations "exist" and gives esti-
mates of their magnitude and their statistical significance. Perhaps more important, log-
linear analysis provides a method for telling which model (or models) among the many
possible in complex frequency tables fits the empirical data. For the data of Table 10.13,
for example, the only model that fits the data is expressed in Equation 10.3 (or 10.4),
which means essentially what we expressed earlier in words: there is a second-order
interaction in the data, or level of aspiration is related to college success more strongly in
the middle class than in the working class.
The reader should not feel chagrinedif he has had trouble following the above discus-
sion. Log-linear analysis, while powerful and highly useful, is a complex method that
requires careful and deep study. The discussion of some of its characteristics was at-
tempted here to introduce the highly important ideas of model and testing for fit, to
apprise the reader of log-linear analysis, one of the most important contemporary method-
ological developments, and to widen our knowledge of multivariate analysis and multi-
variate research problems.
Study Suggestions
Experimental Control
(Lie) (Not Lie)
Comply 20 1
Not Comply 1 20
Calculate x^, C, and percentages. Interpret the results. Is the hypothesis supported? Is the relation
-'Users of log-linear analysis, however, would usually only write the model as ABC. The other terms are
implied. Or it could be written:
weak, moderate, strong? (.4;i.v»rri; )f = 5.23 {p < .05); C = .28. Yes, the hypothesis is sup-
ported. The relation is weak to moderate.)
2. On June 10, 1981, the U.S. Senate voted on requiring households to pay in part for food
stamps. The vote was:''*
For Against
Republican 26 27
Democral 7 39
Calculate x^. C. and percentages. Interpret the results. {Answers: -ff - 12.69; C = .34.)
3. In the important Kemer Commission Report on civil disorders (riots) in the United States, a
large number of relations are presented in table form. Here is one of the tables, from a Newark
survey, which gives the percentages of responses to the question, "Sometimes I hate white peo-
ple," given by rioters (/?) and people not involved (W/)."'
172 • Analysis, Interpretation, Statistics, and Inference
C= .61, and r = .77. It is suggested that readers study Everitt's or Reynolds" extended treatment
and then calculate x^, C, and r for the 2 x 2 tables of this chapter. For the tables with more than two
classifications, calculate x' and C. The advanced student should study newer approaches and meas-
ures, like Goodman and Kruskal's lambda, the odds ratio, and the log odds ration.
Except for large tables, the calculation of ;^". C. and other coefficients of association is not
6.
difficultand can be done with good small calculators. For those students who have programmable
calculators or microcomputers, it is fairly easy to write programs that will calculate x'- C. and other
indices. Some machines (e.g., Hewlett-Packard HP-67 and HP-41CV) have accompanying software
(programs) that do the calculations. Virtually any computer installation will have programs for
contingency table analysis. One of the easiest to use is the widely available Minitab set of programs,
which has a program. Tables, that calculates x' for various tables. Unfortunately, Minitab calculates
no indices of association. (Note, however, that if you have x^ you can easily calculate C and r with
a hand calculator.) Log-linear models can be calculated "by hand," but such calculations are often
difficult and risky. The Goodman computer program ECTA is not hard to use and is efficient. The
BMDP computer programs, BMDP3F (1977 and 1979 manuals) and BMDP4F 198 manual)'" are ( 1
highly sophisticated programs that are harder to use than ECTA but that yield more analytic results.
7. Have occupations of women changed under the impact of the women's liberation move-
ment? Here are interesting data from a U.S. Census Report (in thousands);"
1970 1977
Male Female Male Female
Professional,
Managerial, 12,005 5,637 15,261 7,943
Administrative
Clerical, Sales
Service 10.413 16.489 11,213 21,471
{Note: The above figures were obtained by adding the categories Professional -I- Managerial +
Administrative; Clerical -I- Sales + Service.)
(a) Calculate percentages. (Be careful: calculate from the independent variables to the de-
pendent variable, as usual.)
(b) Calculate x' and C for 1970 and 1977 separately. (Use the above figures, i.e., neglect
the fact that the figures indicate thousands. This affects x' but not C.)
(c) Interpret the results of your calculations. (Be circumspect. The method of adding the
category numbers may have been biased or even incorrect.)
(d) In b, above, you calculated x^ and C using the tabled frequencies as they are. Now do
the same calculations using the numbers in the thousands, i.e., instead of 12.005, for
instance, use 12,005,000. Note the enormous increase in x' but C is the same. Here is
a generalization: With very large numbers virtually everything is statistically significant.
This is one reason for measures of association that are unaffected by the magnitude of
the numbers.
'"W. Dixon and M. Brown, eds., BMDP-77: Biomedical Computer Programs, P-Series. Berkeley: Univer-
sityof California Press, 1977. pp. 297-332; W. Dixon, ed.. BMDP Statistical Sofrnare. Berkeley: University of
California Press. 1981. pp. 143-206. The 1977 program is excellent and. with application and study, readily
used. The 1981 4F program is an extensive revision of the earlier program and. while also excellent, is harder to
use because so many possibilities and options have been incorporated into one program. Although praised by
some, the trend toward highly complex programs that do "everything" is perhaps unfortunate because it puts
programs beyond the understanding of many users and makes their intelligent and discerning use more difficult.
{Advice: Don't try to use log-linear analysis and programs without deep and long study. This is a powerful
methodology: it is also highly complex and difficull.)
"U.S. Dept. of Commerce. Bureau of the Census. Social and Economic Characteristics of the Metropolitan
and Nonmelropolilcm Population: 1977 and 1970. Washington. D.C.: U.S. Government Printing Office. 1978,
pp. 74-79. (Current Population Reports, Special Studies P-23. No, 75.) (The efficient citation of government
publications is an arcane art. For some reason such publications have long and complex sources and names.)
The Analysis of Frequencies • 173
8. Here are some interesting frequency data obtained in a now well-known study by the Survey
Research Center, University of Michigan, on the determinants of feelings of pohtical efficacy. The
1.223 subjects of the crossbreak were a random sample of the people of the United States. In this
case the relation was between education and feelings of political efficacy among men and women.''
Political Efficacy
Grade School 48 46 56
Men High School 126 77 64
College 96 .
14 6
COMPUTATIONAL ADDENDUM
The you obtain from your calculations may not agree exactly with those of the text. This is
results
numbers
often because of so-called errors of rounding. In calculations with fractional —
for instance,
1234.567. 482.791, and the like —
it is often necessary to round off results to, say, two decimal
places. Muhiply 1234.567 by 482.791; the resuh is 596,037,8365. Round this product to two
decimal places: 596,037.84. Now imagine hundreds of such multiplications, and you will realize
that obtained products will not be completely accurate because small errors cumulate. Large com-
puters with large memories and processors that work with very large numbers give the most accurate
results. Results obtained with microcomputers and handheld calculators, on the other hand, are
much less accurate. For example, the x^ calculated with the data of Table 10.6 was reported as
65.1864. This was calculated with a handheld calculator (but a highly accurate one). The x^ ob-
tained with a microcomputer of greater capacity and accuracy, however, was 65.1860.
'-A. Campbell. P. Converse. W. Miller, and P. Stokes, The American Voter. New York: Wiley, 1960. The
above tabled frequencies were calculated from the authors' Table 17-9, p. 491, in which percentages were
reported. (This table included another variable: South. The above data were from the non-South.)
Chapter i i
Statistics: Purpose,
Approach, Method
chance expectation as their hypothesis and to try to fit empirical data to the chance model.
If the empirical data "fit" the chance model, then it is said that they are "not signifi-
cant." If they do not fit the chance model, if they depart "sufficiently" from the chance
model, it is said that they are "significant."
This and several succeeding chapters are devoted to the statistical approach to research
problems. In this chapter we extend the discussion of Chapter 7 on probability to basic
conceptions of the mean, variance, and standard deviation. The so-called law of large
numbers and the normal probability curve are also explained and interpreted, and some
idea is given of their potent use in statistics. In the next chapter we tackle the idea of
statistical testing itself. These two chapters are the foundation.
Four purposes of statistics are suggested in this definition. The first is the commonest
and most traditional: to reduce large quantities of data to manageable and understandable
form. It is impossible to digest 100 scores, for instance, but if a mean and a standard
deviation are calculated, the scores can be readily interpreted by a trained person. The
definition of "statistic" stems from this traditional usage and purpose of statistics. A
statistic is a measure calculated from a sample. A statistic contrasts with a parameter.
which is a population value. If, in U. a population or universe, we calculate the mean, this
is a parameter. Now take a subset (sample) A of U. The mean of A is a statistic. For our
purpose, parameters are of theoretical interest only. They are not usually known. They are
estimated with statistics. Thus we deal mostly with sample or subset statistics. These
samples are usually conceived to be representative of U. Statistics, then, are epitomes or
summaries of the samples —
and often, presumably, of the populations from which they —
are calculated. Means, medians, variances, standard deviations, percentiles, percentages,
and so on, calculated from samples, are statistics.
A second purpose of statistics is to aid in the study of populations and samples. This
use of statistics is so well known that it will not be discussed. Besides, we studied
something of populations and samples in earlier chapters.
A third purpose of statistics is to aid in decision making. If an educational psycholo-
gist needs to know which of three methods of instruction promotes the most learning with
the least cost, he can use statistics to help him gain this knowledge. This use of statistics is
comparatively recent.
Although most decision situations are more complex, we use an example that is quite
familiar by now. A decision-maker dice gambler would first lay out the outcomes for dice
throws. These are, of course, 2 through 12. He notes the differing frequencies of the
numbers. For example, 2 and 12 will probably occur much less often than 7 or 6. He
calculates the probabilities for the various outcomes. Finally, on the basis of how much
money he can expect to make, he devises a betting system. He decides, for instance, that,
since 7 has a probability of 1/6, he will require that his opponent give him, say, odds of
5 to 1 instead of even money on the first throw. (We here take liberties with craps.) To
make this whole thing a bit more dramatic, suppose that two players operate with different
decision-makers.' One player. A, proposes the following game: A will win if 2, 3, or 4
'This example was suggested by I. Bross, Design for Decision. New York; Macmillan, 1953, p. 28.
176 • Analysis, Interpretation, Statistics, and Inference
tion that 2, 3. 4, 5, 6, and 7 are equiprobable. B should have a good time with this game.
The fourth and last purpose of statistics, to aid in making reliable inferences from
observational data, is closely allied to, indeed, is part of, the purpose of helping to make
decisions among hypotheses. An inference is a proposition or generalization derived by
reasoning from other propositions, or from evidence. Generally speaking, an inference is
a conclusion arrived at through reasoning. In statistics, a number of inferences may be
drawn from tests of statistical hypotheses. We
"conclude" that methods A and B really
differ. We conclude from evidence, say r = .67, that two variables are really related.
Statistical inferences have two characteristics. One, the inferences are usually made
from samples to populations. When we say that the variables A and 8 are related because
the statistical evidence is r = .67, we are inferring that because r = .67 in this sample it
is r = .67, or near .67, in the population from which the sample was drawn. The second
kind of inference is used when investigators are not interested in the populations, or only
interested secondarily in them. An educational investigator is studying the presumed
effect of the relationships between board of education members and chief educational
administrators, on the one hand, and teacher morale, on the other hand. His hypothesis is
that, when relationships between boards and chief administrators are strained, teacher
morale is lower than otherwise. He is interested only in testing this hypothesis in Y
County. He makes the study and obtains statistical results that support the hypothesis, for
example, morale is lower in system A than in systems B and C. He infers, from the
statistical evidence of a difference between system A, on the one hand, and systems B and
C, on the other hand, that his hypothetical proposition is correct in Y County. And it is —
possible for his interest to be limited strictly to County. Y
To summarize much of the above discussion, the purposes of statistics can be reduced
to one major purpose: to aid in inference-making. This is one of the basic purposes of
research design, methodology, and statistics. Scientists want to draw inferences from
data. Statistics, through its power to reduce data to manageable forms (statistics) and its
power to study and analyze variances, enables scientists to attach probability estimates to
the inferences they draw from data. Statistics says, in effect, "The inference you have
drawn is correct at such-and-such a level of significance. You may act as though your
hypothesis were true, remembering that there is such-and-such a probability that it is
untrue." It should be reasonably clear why some contemporary statisticians call statistics
the discipline of decision making under uncertainty. It should also be reasonably clear
that, whether you know it or not, you are always making inferences, attaching probabili-
ties to various outcomes or hypotheses, and making decisions on the basis of statistical
reasoning. Statistics, using probability theory and mathematics, makes the process more
systematic and objective.
BINOMIAL STATISTICS
When number system used is simple and useful. Whenever objects
things are counted, the
on the basis of some criterion, some variable or attribute, in
are counted, they are counted
research language. Many examples have already been given: heads, tails, numbers on
dice, sex, aggressive acts, political preference, and so on. If a person or a thing possesses
the attribute, the person or thing "counted in," we say. When something is "counted
is
in" because it possesses the attribute in question, it is assigned a If it does not possess
1 .
M = l[X-w(X)] (11.1)
where w(X) is the weight assigned to an X. w(X) simply means the probability each X has
of occurring. The formula says: Multiply each X. each score, by its weight (probability),
and then add them all up. Notice that if all X's are equally probable, this formula is the
same as "^X/n.
The mean of the set {1, 2, 3. 4, 5,} is
1+2 + 3 + 4 + 5 15
M= = = 3
5 5
By Formula 11.1 it is. of course, the same, but its computation looks different:
Af = 1 •
1/5 + 2 •
1/5 + 3 •
1/5 + 4 •
1/5 + 5 •
1/5 = 3
Why the hair-splitting? Let a coin be tossed. U= {//. T}. The mean number of heads is,
by Equation 11.1,
M= 1 •
1/2 + • 1/2 = 1/2
Let two coins be tossed. U= {HH. HT. TH. TT). The mean number of heads, or the
expectation of heads, is
W= 2 •
1/4 + 1 •
1/4 + 1 •
1/4 + • 1/4 = 4/4 = 1
This means that if two coins are tossed many times, the average number of heads per toss
is L If we sample one person from 30 men and 70 women, the mean of men is: M =
3/10 1 + 7/10 = .3. The mean for women is: M = 3/10
•
+ 7/10 = .7. These • •
1
are the means for one outcome. (This is a little like saying "an average of 2.5 children per
family.'")
What has been said in these examples is that the mean of any single experiment (a
single coin toss, a sample of one person) is the probability of the occurrence of one of two
possible outcomes (heads, a man) which, if the outcome occurs, is assigned a 1 and, if it
Af = 1/4 •
2 + 1/4 •
1 + 1/4 •
1 + 1/4 • = 1
Can we arrive at the same result in an easier manner? Yes. Just add the means for each
outcome. The mean of the outcome of one coin toss is 1/2. For two coin tosses it is
1/2 + 1/2 = 1. To assign probabilities with one coin toss, we weight 1 (heads) with its
probability and (tails) with its probability. This gives = p 1 +(1 — p) = p. M
Take the men-women sampling problem. Let p = the probability of a man's being sam-
pled on a single outcome and 1 — p = <? = the probability of a woman's being sampled on
a single outcome. Then p = 3/10 and q = 7/10. We are interested in the mean of a man
being sampled. Since = p-\+q-0 = p, M3/10 1 + 7/10 = 3/10 = p. The M= • •
mean is 3/10 and the probability is 3/10. Evidently = p, or the mean is equal to the M
probability.
178 • Analysis, InteqDretation, Statistics, and Inference
How about a series of outcomes? We write S„ for the sum of n outcomes. One exam-
ple, the tossing of two coins, was given above. Let us take the men- women sampling
problem. The mean of a man's occurring is 3/10 and of a woman's occurring 7/10. We
sample 10 persons. What is themean number of men? Put differently, what is the expecta-
tion of men? If we sum the 10 means of the individual outcomes, we get the answer:
M{Sw) = Mf + M2 + + Mw (11-2)
= 3/10 + 3/10 + + 3/10 = 30/10 = 3
In a sample of 10, we expect to get the answer: 3 men. The same result could have been
obtained by 3/10 10 = 3. But 3/10 10 is pn, or
M{S„)=pn (11.3)
In n trials the mean number of occurrences of the outcome associated with p is pn.
THE VARIANCE
Recall that in Chapter 6 the variance was defined as V= 2.v-/«. Of course, it will be the
same in this chapter, with a change in symbols (for the same reason given with the
formula for the mean):
—
To make clear what a variance and a standard deviation is in — probability theory, we
work two examples. Recall that, binomially, only two outcomes are possible: 1 and 0.
Therefore X is equal to I or 0. We set up a table to help us calculate the variance of the
heads outcome of a coin throw:
occurs is. analogously to Equations 11.2. 11..^. and 11.3. the sum of the individual
outcome variances, or
Equations 11.3. 11.7, and 11.8 are important and useful. They can be applied in
many Take two or three applications of the tV)rmula: first, the Agree-
statistical situations.
Disagree problem of the last chapter. Since a sample of 100 was taken, n = 100.
On the assumption of equiprobability, p = 1/2 and q = 1/2. Therefore, M(S|oo) = np =
100 1/2 = 50, V(SuK)) = npq = 100 1/2 1/2 = 25, and 5D(5|„„) = V25 = 5. It was
• •
found that there were 60 Agrees. So, this is a deviation of two standard deviations from
the mean of 50. 60 - 50 = 10. and 10/5 = 2. Second, take the coin-tossing experiment
of the chapter on probability. In one experiment, 52 heads turned up in 100 tosses. The
calculations are the same as those just given. Since there were 52 heads, the deviation
from the mean, or expected frequency, is 52 - 50 = 2. In standard deviation terms or
units, this is 2/5 = .4 standard deviation units from the mean. We now get back to one of
the original questions we asked: Are these differences "statistically significant"? We
found, via x^. that the result of 60 Agrees was statistically significant and that the result of
52 heads was not statistically significant. Can we do the same thing with the present
formula'.' Yes, we can. Further, the beauty of the present method is that it is applicable to
all kinds of numbers, not just to binomial numbers. Before demonstrating this, however,
we must .study, if only briefly, the so-called law of large numbers and the properties of the
standard deviation and the normal probability curve.
as you increase the size of samples, you also decrease the probability that the observed
value of an event. A, will deviate from the "true" value of A by no more than a fixed
amount, k. Provided the members of the samples are drawn independently, the larger the
sample the closer the "true" vglue of the population is approached. The law is also a
gateway to the testing of statistical hypotheses, as we shall see.
Toss a coin 10, 50, 100, 400, and 1000 times. Let heads be the outcome in which
1 ,
we are interested. We calculate means, variances, standard deviations, and two new
measures. The first of these new measures is the proportion of favorable outcomes, heads
in this case, in the total sample. We call this measure H„ and define it as H„ = S„/n.
( Recall that S„ is the total number of times outcome occurs in n trials Then
the favorable .
)
the fraction of the time that the favorable outcome occurs is //„. The mean of H„ is p. or
M(H„) = p. [This follows from Equation 1.3. where M{S„) = pn. and since H„ = S„/ii.
1
M(H„) = M[S„)ln = npin = p.] In short, M(H„) equals the expected probability. The sec-
ond measure is the variance of//,,. It is defined: V(H„) = pq/n. The variance, V{H„). is a
"A bnef statement of the law by Bernoulli himself can be found in: J. Newman, The World of Mathematics.
vol. 3. New York; Simon and Schuster. 1956. pp. 1452-1455. For more exact statements than are possible in
this text, see ibid., pp. 1448-1449.
180 • Analysis, Interpretation, Statistics, and Inference
measure of the variability of the mean, M(H„). Later more will be said about the square
root of V{H„). called the standard error of the mean. The results of the calculation are
given in Table 11.1
TABLE 11.1 Means. Variances, Standard Deviations, and Expected Probabilities of the Outcome
Heads with Different Size Samples^
n
Statistics: Purpose, Approach, Method • 181
50
182 • Analysis, Interpretation, Statistics, and Inference
the curve are conceived as probabilities and interpreted as such. If the total area under the
whole curve is equal to 1 .00. then if a vertical line is drawn upward from the base line at
the mean (r = 0) to the top of the bell, the areas to the left and to the right of the vertical
line areeach equal to 1/2 or 50 percent. But vertical lines might be drawn elsewhere on the
base line, at one standard deviation above the mean (z = 1) or two standard deviations
below the mean (r = -2). To interpret such points in area terms and in probability —
terms — we
must know the area properties of the curve.
The approximate percentages of the areas one, two, and three standard deviations
above and below the mean have been indicated in Figure 11.1. For our purposes, it is not
necessary to use the exact percentages. The area between z = — 1 and z = + is approxi- 1
mean (z = -I- 1) is approximately .68. Roughly, then, there are two out of three chances
that the number of heads will be between 45 and 55 (50 ± 5). There is one chance in
three, approximately, that the number of heads will be less than 45 or greater than 55.
That is, 9 = 1 — p = I - .68 = .32, approximately.
Take two standard deviations above and below the mean. These points would be
50 - (2)(5) = 40 and 50 + (2)(5) = 60. Since we know that about 95 or 96 percent of the
cases will probably fall into this band, that is, between z = -2 and z = -1-2, or between
40 and 60, we can say that the probability that the number of heads will not be less than 40
nor greater than 60 is about .95 or .96. That is, there are only about 4 or 5 chances in 100
that less than 40. or more than 60, heads will occur. It can happen. But it is unlikely to
happen.
If we want or need to be practically certain (as in certain kinds of medical or engineer-
we can go out to three standard deviations, z = — 3 and z = -1-3, or
ing research), then
perhaps somewhat less than three standard deviations. (The .01 level is about two and a
Statistics: Purpose, Approach, Method • 183
half standard deviations.) Three standard deviations means the numbers of heads between
35 and 65. Since three standard deviations above and below the mean in Figure 11.1 take
up more than Wpercent ot the area of the curve, we can say that we are practically certain
that the number of heads in 100 tosses of a fair coin will not be less than 35 nor more than
65. The probability is greater than .99. If you tossed a coin 100 times and got, say, 68
heads, you might conclude that there was probably something wrong with the coin. Of
course, 68 heads can occur, but it is very, very unlikely that they will.
The earlier Agree-Disagree problem is treated exactly the same as the coin problem
above. The result of 60 Agrees and 40 disagrees is unlikely to happen. There are only
about 4 chances in 100 thai 60 Agrees and 40 Disagrees will happen by chance. We knew
this before from the \^ test and from the exact probability test. Now we have a third way
that is generally applicable to all kinds of data —
provided the data are distributed nor-
mally or approximately so.
a gigantic distribution of means (and standard deviations). What will this distribution be
like? First, it will form a beautiful bell-shaped normal curve. Means do. They have the
property of falling nicely into the normal distribution, even when the original distributions
from which they are calculated are not normal. This is because we assumed "other things
equal and thus have no source of mean fluctuations other than chance. The means will
fluctuate, but the fluctuations will all be chance fluctuations. Most of these fluctuations
will cluster around what we will call the "true" mean, the "true" value of the gigantic
population of means. A few will be extreme values. If we repeated the 100 coin-tosses
experiment many many times, we would find that heads would cluster around what we
know is the "true" value: 50. Some would be a little higher, some a little lower, a few
considerably higher, a few considerably lower. In brief, the heads and the means will
obey the same "law." Since we assumed that nothing else is operating, we must come to
the conclusion that these fluctuations are due to chance. And chance errors, given enough
of them, distribute themselves into a normal distribution. This is the theory. It is called the
theory of errors.
Continuing our story of the mean, if we had the data from the very many administra-
tions of the mathematics test to the same group, we could calculate a mean and a standard
184 • Analysis, Inteqjretation. Statistics, and Inference
deviation. The mean so calculated would be close to the "true" mean. If we had an
infinitenumber of means from an infinite number of test administrations and calculated
the mean of the means, we would then obtain the "true" mean. Similarly for the standard
deviation of the means. Naturally, we cannot do this, for we do not have an infinite or
even a very large number of test administrations. There is fortunately a simple way to
solve the problem. It consists in accepting the mean calculated from the sample as the
"true" mean and then estimating how accurate this acceptance (or assumption) is. To do
this, a statistic known as the standard error of the mean is calculated. It is defined:
SE\ (11.9)
V~n
where the standard error of the mean is SE^: the standard deviation of the population {a
is read "sigma"), cr^op- and the number of cases in the sample, n.
There is a little snag here. We do not know, nor can we know, the standard deviation
of the population. Recall that we also did not know the mean of the population, but that
we estimated it with the mean of the sample. Similarly, we estimate the standard deviation
of the population with the standard deviation of the sample. Thus the formula to use is
SD
SEm — (11.10)
The mathematics test mean can now be studied for its reliability. We calculate:
10 _ 10
SEt. 1
VIOO 10
Again imagine a large population of means of this test. If they are put into a distribu-
tion and the curve of the distribution plotted, the curve would look something like the
curve of Figure 11.2. Keep firmly in mind; this is an imaginary distribution of means of
samples. It is not a distribution of scores. It is easy to see that the means of this distribu-
we double the standard error of the mean we get 2. Subtract
tion are not very variable. If
and add mean of 70: 68 to 72. The probability is approximately .95 that the
this to the
population ("true") mean lies within the interval 68 to 72, that is, approximately 5
percent of the time the means of random samples of this size will lie outside this interval.
If we do the same calculation for the intelligence test data of Figure 11.1, we obtain
16 16
SEm — .80
20
70
Figure 11.2
Statistics: Purpose. Approach, Method • 185
Three standard errors above and below the mean of 100 give the range 97.60 to 102.40, or
we can say that the "true" mean very probably (less than percent chance of being 1
wrong) lies within the interval 98 to 102. Means are reliable with fair-size samples. — ""
The standard error of the mean, then, is a standard deviation. It is a standard deviation
of an infinite number of means. Only chance error makes the means fluctuate. Thus the
standard error of the mean —
or the standard deviation of the means, if you like is a —
measure of chance or error in its effect on one measure of central tendency.
A caution is in order. All of the theory discussed is based on the assumptions of
random sampling and independence of observations. If these assumptions are violated, the
reasoning, while not entirely invalidated, practically speaking, is open to question. Esti-
mates of error may be biased to a greater or lesser extent. The trouble is we cannot tell
how much a standard error is biased. Years ago Guilford gave interesting examples of the
biases enctiuntercd when the assumptions are violated.'' With large numbers of Air Force
pilots, he found that estimates of standard errors were sometimes considerably off. No one
can give hard-and-fast rules. The best maxim probably is: Use random sampling and keep
observations independent, if at all possible.
If random sampling cannot be used, and if there is doubt about the independence of
observations, calculate the statistics and interpret them. But be circumspect about inter-
pretations and conclusions; they may be in error. Because of such possibilities of error, it
has been said that statistics are misleading, and even useless. Like any other method
consulting authority, using intuition, and the like —
statistics can be misleading. But even
when measures are biased, they are usually less biased than authoritative and
statistical
intuitive judgements. It is not that numbers lie. The numbers do not know what they are
doing. It is that the human beings using the numbers may be informed or misinformed,
biased or unbiased, knowledgeable or ignorant, intelligent or stupid. Treat numbers and
statistics neither with too great respect nor too great contempt. Calculate statistics and act
as though they were "true," but always maintain a certain reserve toward them, a willing-
ness to disbelieve them if the evidence indicates such disbelief.^
''Even with relatively small samples, the mean is quite stable. (See the intelligence test data in Chapter 8.)
Five samples of 20 intelligence scores each were drawn from a population of such scores with a mean of 95. The
means of the five samples were calculated. Standard errors of the mean were calculated for the first two samples,
and interpretations were made. Then comparisons were made to the "true" value of 95. The mean of the first
sample was 93.55 with a standard deviation of 12.22. SE^ = 2.73. The .05 level range of means was; 88.09 to
99.01. Obviously 95 falls within this range. The mean of the second sample was more deviant; 90.20. The
standard deviation was 9.44. SEu = 2.11. The .05 level range was 85.98 to 94.42. Our 95 does not fall in this
range. The .01 level range is; 83.87 to 96.53. Now 95 is encompassed. This is not bad at all for samples of only
20. For samples of 50 or 100 it would be even better. The mean of the five means was 93.31; the standard
deviation of these means was 2.73. Compare this to the standard errors calculated from the two samples; 2.73
and 2.1 1. In the next chapter a more convincing demonstration of the stability of means will be given.
"J. Guilford. Fundamental Psychology and Education. 3d ed. New York; McGraw-Hill, 1956.
Statistics in
pp. 169-173.
'Study suggestions for this chapter are given at the end of the next chapter.
Chapter ii
Testing Hypotheses
and the Standard Error
The standard error, as an estimate of chance fluctuation, is the measure against which
the outcomes of experiments are checked. Is there a difference between the means of two
experimental groups? If so, is the difference a "real" difference or merely a consequence
of the many relatively small differences that could have arisen by chance? To answer this
question, the standard error of the differences between means is calculated and the ob-
tained difference is compared to this standard error. If it is sufficiently greater than the
standard error, it is said to be a "significant" difference. Similar reasoning can be applied
to any statistic. Thus, there are many standard errors: of correlation coefficients, of differ-
ences between means, of means, of medians, of proportions, and so on. The purpose of
this chapter is to examine the general notion of the standard error and to see how hypothe-
ses are tested using the standard error.
situatidnally taste their food hefore salting it.' They further reasoned that the former
individuals would ascribe more traits to latter individuals. They found
themselves than the
that theformer group, the "salters," ascribed a mean of 14.87 traits to themselves,
whereas the latter group, the '"tasters." ascribed a mean of 6.90 traits to themselves. The
direction of the difference was as the authors predicted. Is the size of the difference
between the means. 7.97, sufficient to warrant the authors" claim that their hypothesis was
supported? A test of the statistical significance of this difference showed that it was highly
significant.
The point of this example in the present context is that the difference between means
was tested for statistical significance with a standard error. The standard error in this case
was the standard error of the difference between two means. The difference was found to
be significant. This means that those individuals who perceive behavior as influenced by
individual traits tend to salt their food before tasting it, whereas those individuals whose
perception is more environmentally oriented taste their food before salting it. (This state-
ment is a generalization of the original.) Now let us look at an example in which the
difference between means was not significant.
Gates and Taylor, in a well-known early study of transfer of training, set up two
matched groups of 16 pupils each." The experimental group was given practice in digit
memory; the control was not. The mean gain of the experimental group right after the
practice period was 2.00; the mean gain of the control group was .67, a mean difference of
1.33. Four to five months later, the children of both groups were tested again. The mean
score of the experimental group was 4.71; the mean score of the control group was
surprising: 4.77. The mean gains over the initial tests were .35 and .36. Statistical tests
are hardly necessary with data like these.
2.75. a difference of .11, which was statistically significant. Is such a small difference
meaningful? Contrast this small difference to one of the mean differences between experi-
mental and control groups obtained by Mann and Janis in their study of the long-term
effects of role playing on smoking: 13.50 and 5.20. (These are mean decreases in number
"*
'
M. McGee and M. Snyder. "Attribution and Behavior: Two Field Studies."" Journal of Personality and
Social Psychology. 32 (1975). 185-190.
-A. Gates and G. Taylor. '"An Experimental Study of the Nature of Improvement Resulting from Practice
in a Mental Function." Journal of Educational Psychology. 16 (1925). 583-592.
*?. Goldberg. M. Gottesdiener. and P. Abramson. "Another Put-Down of Women?: Perceived Attractive-
ness as a Function of Support for the Feminist Movement." Journal of Personality and Social Psychology. 32
(1975). 113-115.
""L. Mann and I. Janis, "A FoUow-Up on the Long-Term Effects of Emotional Role Playing."' Journal of
ences and one of practical or "real"" significance versus statistical significance. What
appears to be a very small difference may. upon close examination, not be so small. In the
Goldberg et al. study, to be sure, the difference of .11 is probably trivial even though
statistically significant. The . 1 1 was derived from a five-point scale of attractiveness, and
is thus really small. Now take an entirely different sort of example from an important
study by Miller and DiCara on the instrumental conditioning of urine secretion."^ The
means of a group of rats before and after training to secrete urine were .017 and .028, and
the difference was highly statistically significant. But the difference was only .01 1 Is this .
not much too small to warrant serious consideration? But now the nature of the measures
has to be considered. The small means of .017 and .028 were obtained from measures of
urine secretion of rats. one considers the size of rats' bladders and that the mean
When
difference of .011 was produced by instrumental conditioning (reward for secreting
urine), the meaning of the difference is dramatic: it is even quite large! (We will analyze
the data in a later chapter and perhaps see this more clearly.)
One should mean differences like .20, .15, .08,
ordinarily not be enthusiastic about
and so on. but one has to be intelligent about it. Suppose that a very small difference is
reported as statistically significant, and you think this ridiculous. But also suppose that it
was the mean difference between the cerebral cortex weights of groups of rats under
enriched and deprived experiences in the early days of their lives.* To obtain any differ-
ence in brain weight due to experience is an outstanding achievement and, of course, an
important scientific discovery.
CORRELATION COEFFICIENTS
Correlation coefficients are reported in large quantities in research journals. Questions as
to the significance of the coefficients —
and the "reality"" of the relations they express
must be asked. For example, to be statistically significant a coefficient of correlation
calculated between 30 pairs of measures has to be approximately .31 at the .05 level and
.42 at the .01 level. With 100 pairs of measures the problem is less acute (the law of large
numbers again). To carry the .05 day. an / of .16 is sufficient; to carry the .01 day, an r
of about .23 does it. If r's are less than these values, they are considered to be not
significant.
If one draws, say, 30 pairs of numbers from a table of random numbers and correlates
them, theoretically the r should be near zero. Clearly, there should be near-zero relations
between sets of random numbers, but occasionally sets of pairs can yield statistically
significant and reasonably high r's by chance. At any rate, coefficients of correlation, as
well as means and differences, have to be weighed in the balance for statistical signifi-
cance by stacking them up against their standard errors. Fortunately, this is easy to do,
since /"s for different levels of significance and for different sizes of samples are given in
tables in most statistics texts. Thus, with r's it is not necessary to calculate and use the
standard error of an r. The reasoning behind the tables has to be understood, however.^
'N. Miller and L. DiCara. 'Instrumental Learning of Urine Formation by Rats: Changes in Renal Flow."
Amerkan Jounuil of Physiology. 215 (1968). 677-683.
•E. Bennett et al.. '•Chemical and Anatomical Plasticity ot Brain," Sc/cHce. 146 (1964). 610-619. (See.
especially, remarks on p. 618.)
^
There has been a good deal of misunderstanding about the assumptions that have to be satisfied lo
calculate coefficients of correlation. No assumptions have to be satisfied simply to calculate r's. The assump-
tions come in only when we wish to infer from the sample to the population. See W. Hays. Statistict. 3d ed. New
York; Holt, Rinehart and Winston, 1981, pp. 459-461.
Testing Hypotheses and the Standard Error • 189
Of
the thousands of correlation coefficients reported in the research literature, many
are of low magnitude. How low is low'.' At what point is a correlation coefficient too low
to warrant treating it seriously? Usually, /"s less than 10 cannot be taken too seriously: an .
r of .10 means that only one percent 10") of the variance of _v is shared with .v! If an r
( .
of .30, on the other hand, is statistically significant, it may be important because it may
point to an important relation, r's between .20 and .30 make the problem more difficult.
(And remember that with large N'a. r's between .20 and .30 are statistically significant.)
To be sure, an r, say, of .20 means that the two variables share only four percent of their
variance. But an r of .26 — seven percent of the variance shared —
or even one of .20 may
be important because it may provide a valuable lead for theory and subsequent research.
The problem is complex. In basic research, low correlations — of course, they should be
statistically significant — may enrich theory and research. It is in applied research, where
prediction is important, that value judgments about low correlations and the trivial
amounts of variance shared have grown. In basic research, however, the picture is more
complicated. One conclusion is fairly sure: correlation coefficients, like other statistics,
must be tested for statistical significance.
bles is expressed, for example, "The greater the cohesiveness of a group the greater its
influence on its members"** is a substantive hypothesis. An investigator's theory dictates
that this variable is related to that variable. The statement of the relation is a substantive
hypothesis.
A substantive hypothesis itself, strictly speaking, is not testable. It has first to be
translated into operational terms. One very useful way to test substantive hypotheses is
the variables (of the problem). The null hypothesis says, "You're wrong, there is no
'S. Schachter el al.. "An Experimental Study of Cohesiveness and Productivity," Human Relations. 4
(1951). 229-238.
190- Analysis, Interpretation, Statistics, and Inference
relation; disprove me if you can." It says this in statistical terms such as M^ = Mb. or
Ma — Mb = 0; />-,= 0; ;^ is not significant; is not significant; and so on."*
/
Fisher says, "Every experiment may be said to exist only in order to give the facts a
chance of disproving the "'° Aptly said. What does it mean? Suppose you
null hypothesis.
method A is superior to method B. If you satisfac-
entertain a hypothesis to the effect that
torily solve the problems of defining what you mean by " superior," of setting up an
experiment, and the like, you now must specify a statistical hypothesis. In this case, you
might say M^ > Mb (the mean of method A is, or will be greater than the mean of method
B on such-and-such a criterion measure). Assume that after the experiment the two means
are 68 and 61 respectively. It would seem that your substantive hypothesis is upheld since
.
68 > 61, or Mfy is greater than Mb- As we have already learned, however, this is not
enough since this difference may be one of the many possible similar differences due to
chance.
In effect, we set up what can be called the chance hypothesis: M^ = Mb. or M^ -
Mb = 0. These are null hypotheses. What we do. then, is write hypotheses. First we write
the statistical hypothesis that reflects the operational-experimental meaning of the sub-
stantive hypothesis. Then we write the null hypothesis against which we test the first type
of hypothesis. Here are the two kinds of hypothesis suitably labeled:
//,: M^ >Mb
Wo: Ma = Mb
H] means "Hypothesis 1." There is often more than one such hypothesis. They are
labeled //i. Z/^, Hi., and so on. //« means "null hypothesis." Note that the null hypothesis
could in this case have been written:
Ho: Ma-Mb =
This form shows where the null hypothesis got its name: the difference between M^ and
Mb is zero. But it is a little unwieldy in this form, especially when there are three or more
means or other statistics being tested. M^ = Mb is general, and of course means the same
as Ma — Mb = and Mb - Ma = 0. Notice that we can write quite easily Ma = Mb =
Mc = • • • = A/iv.
'Researchers sometimes unwittingly use null hypotheses as substantive hypotheses. Instead of saying that
one method of presenting textual materials has a greater effect on recall memory than another method, for
instance, they may say that there is no difference between the two methods. This is poor practice because it in
effect uses the statistical null hypothesis as a substantive hypothesis and thus confuses the two kinds of hypothe-
ses. Strictly speaking, any significant result, positive or negative, then, supports the hypothesis. But this is
certainly not the intention. The intention is to bring statistical evidence to bear on the substantive hypothesis, for
example, on M^ > Mg. If the result is statistically significant —
that is. that M^ ^ Mb. or the null hypothesis is
rejected and M^ > Mb. then the substantive hypothesis is supported. Using the null hypothesis substantively
loses the power of the substantive hypothesis, which amounts to the investigator making a specific nonchance
prediction.
There is of course always the rather rare possibility that a null hypothesis is the substantive hypothesis. If.
for example, an investigator seeks to show that two methods of teaching make no difference in achievement,
then the null hypothesis is presumably appropriate. The trouble now is that the investigator is in a difficult
logical position because it is extremely difficult, perhaps impossible, to demonstrate the empirical "validity" of
a null hypothesis. After all. if the hypothesis M^ = Mb is supported, it could well be one of the many chance
results that are possible rather than a meaningful nondifference! Good discussions of hypothesis testing are given
in: R. Giere, Understanding Scientific Reasoning. New York: Holt. Rinehart and Winston. 1979, chaps. 6. 8.
11, and 12, especially chap. 11.
'"R. Fisher. The Design of Experiments, 6th ed. New York: Hafner. 1951. p. 16.
Testing Hypotheses and the Standard Error • 191
If this were the best of all possible research worlds, there would be no random error. And
if there were no random error, there would be no need for statistical tests of significance.
The word ""significance"' would be meaningless, in fact. Any difference at all would be a
"real"" difference. But such is never the case, alas. There are always chance errors (and
biased errors, too), and in behavioral research they often contribute substantially to the
total variance. Standard errors are measures of this error, and are used, as has repeatedly
been said, as a sort of yardstick against which experimental or "variable" variance is
checked.
The standard error is the standard deviation of the sampling distribution of any given
measure —
the mean or the correlation coefficient, for instance. In most cases, population
or universe values (parameters) cannot be known; they must be estimated from sample
measures, usually from single samples.
Suppose we draw a random sample of 100 children from eighth-grade classes in
such-and-such a school system. It is difficult or impossible, say, to measure the whole
universe of eighth-grade children. We calculate the mean and the standard deviation from
a test we give the children and find these statistics to be M
= 1 10; SD = 10. An impor-
tant question we must ask ourselves is ""How accurate is this mean?" Or, if we were to
draw a large number of random samples of 100 eighth-grade pupils from this same pop-
ulation, will the means of these samples be 10 or near 110? And, if they are near 10,
1 1
means, or we knew what it was, everything would be simple. But we do not know
if
this value, and we are not able to know it since the possibilities of drawing different
samples are so numerous. The best we can do is to estimate it with our sample value, or
sample mean. We simply say, in this case, let the sample mean equal the mean of the
population mean —
and hope we are right. Then we must test our equation. This we
do with the standard error.
A similar argument applies to the standard deviation of the whole population (of the
original scores). We do not know and probably can never know it. But we can estimate it
with the standard deviation calculated from our sample. Again, we say, in effect, let the
standard deviation of the sample equal the standard deviation of the population. We know
they are probably not the same value, but we also know, if the sampling has been random,
that they are probably close.
In Chapter 1 1 the sample standard deviation was used as a substitute for the standard
deviation of the population in the formula for the standard error of the mean:
SD
SEm = ^j^ (12.1)
Vn
This is also called the sampling error. Just as the standard deviation is a measure of the
dispersion of the original scores, thestandard_eiTorjiLlhejTiean^
dispersion of t he distribution of sample m eans. It is not thestan3ar3~devIatioirof the
jKipination of individual scores if, for exaniple, we could test every member of the popu-
lation and calculate the mean and standard deviation of this population.
192 • Analysis, Interpretation, Statistics, and Inference
The Procedure
A computer program was written to generate 4000 random numbers evenly distributed
between and 100 (so that each number has an equal chance of being "drawn") in 40 sets
of 100 numbers each, and to calculate various statistics with the numbers. Consider this
set of 4000 numbers a population, or U. The mean of U is 50.33 (by actual computer
calculation) and the standard deviation 29. 17. We wish to estimate this mean from sam-
ples that we draw randomly from U. Of course, in a real situation we would usually not
know the mean of the population. One of the virtues of Monte Carlo procedures is that we
can know what we ordinarily do not know.
Five of the 40 sets of 100 numbers were drawnat random. (The sets drawn were 5,7,
8, 16,and 36. See Appendix C. The means and standard deviations of the five sets were
)
calculated. So were the five standard errors of the mean. These statistics are reported in
Table 12.1. We want to give an intuitive notion of what the standard error of the mean is
and then we want to show how it is used.
Table 12.1 Means. Standard Deviations, and Standard Errors of the Mean, Five Samples of
100 Random Numbers (0 through 100)*
Testing Hypotheses and the Standard Error • 193
49.02, are rather close to the population mean, and two of them, 53.21 and 55.51, are
farther away from it. So it seems that three of the samples provide good estimates of the
population mean and two do not — or do they?
The standard deviation of 2.38 is akin to the standard error of the mean. (It is, of
course, not the standard error of the mean, because it has been calculated from only five
means.) Suppose only one sample, the first with M = 53.21 and SD - 29.62, had been
drawn —
and this is the usual situation in research — and the standard error of the mean
calculated:
SD l^.bl
SEm = —F^ = , = 2.96
This value is an estimate of the standard deviation of the population means of many, many
samples of 100 cases, each randomly drawn from the population. Our population has 40
groups and thus 40 means. (Of course, this is not many, many means.) The standard
deviation of these means is actually 3.10. The SE^f calculated with the first sample, then,
is close to this population value: 2.96 as an estimate of 3.10.
The five standard errors of the mean are given in the third data line of Table 12.1.
They fluctuate very little —
from 2.67 to 2.98 —
even though the means of the sets of 100
scores vary considerably. The standard deviation of 2.38 calculated from the five means is
only a fair estimate of the standard deviation of the population of means. Yet it is an
estimate. The interesting and important point is that the standard error of the mean, which
is a "theoretical" estimate, calculated from the data of any one of the five groups, is an
accurate estimate of the variability of the means of samples of the population.
To now look at another Monte Carlo demonstration of
reinforce these ideas, let's
much The computer program used to produce the 4000 random num-
greater magnitude.
bers for example discussed above was used to produce 15 more sets of 4000 random
numbers each, evenly distributed between and 100. That is, a total of 80,000 random
numbers, in 20 sets of 4000 each, were generated. The theoretical mean, again, of num-
bers between and 100 is 50. Consider each of the 20 sets as a sample of 4000 numbers.
The means of the 20 sets are given in Table 12.2.
50.3322
194 • Analysis, Interpretation, Statistics, and Inference
wrong using it to assess the variability of the means of samples of 4000 random numbers.
Clearly,means of large samples are highly stable statistics, and standard errors are good
estimates of their variability.
Generalizations
Three or four generalizations of great usefulness in research can now be made. One,
means of samples are stable in the sense that they are much less variable than the measures
from which they are calculated. This is, of course, true by definition. Variances, standard
deviations, and standard errors of the mean are even more stable; they fluctuate within
relatively narrow ranges. Even when the sample means of our example varied by as much
as four or five points, the standard errors fluctuated by no more than a point and a half.
This means that we can have considerable faith that estimates of sample means will be
rather close to the mean of a population of such means. And the law of large numbers tells
us that the larger the sample size, the closer to the population values the statistics will
probably be.
A difficult question for researchers is: Do these generalizations always hold, espe-
cially with nonrandom samples? The validity of the generalizations depends on random
sampling. If the sampling is not random, we cannot really know whether the generaliza-
tions hold. Nevertheless, we often have to act as though they do hold, even with nonran-
dom samples. Fortunately, if we are careful about studying our data to detect substantial
sample idiosyncrasy, we can use the theory profitably. For example, samples can be
checked for easily verified expectations. If one expects about equal numbers of males and
females in a sample, or known proportions of young and old or Republican and Democrat,
it is simple to count these numbers. There are experts who insist on random sampling as
Before studying the actual use of the standard error of the mean, we should look, if
extremely important generalization about means: If samples are drawn from
briefly, at an
a population at random, the means of the samples will tend to be normally distributed.
The larger the /)"s, the more this is so. And the shape and kind of distribution of the
original population makes no difference. That is, the population distribution does not have
to be normally distributed."
For example, the distribution of the 4000 random numbers in Appendix C is rectangu-
lar, since the numbers are evenly distributed. If the central limit theorem is empirically
valid, then the means of the 40 sets of 100 scores each should be approximately normally
distributed. If so, this is a remarkable thing. And
though one sample of 40 means
it is so,
is hardly sufficient to show more populations of 4000
the trend too well. Therefore, three
different evenly distributed random numbers, partitioned into 40 subsets of 100 numbers
"See Hays, op. cii.. pp. 215-220. Hays gives a neat example to show how the theorem works (pp.
219-220). Another good discussion with examples is given by G. Snedecor and W. Cochran. Statistical Meth-
ods. 6th ed. Ames. Iowa: Iowa State University Press, 1967, pp. 51-56.
Testing Hypotheses and the Standard Error • 195
each, were generated on the computer. The means calculated for the 4 x 40 = 160 sub-
sets of100 numbers each were calculated and put into one distribution. A frequency
polygon of the means is given in Figure 12.1. It can be seen that the 160 means look
almost like the bell-shaped nonnal curve. Apparently the central limit theorem "works."
And bear in mind that this distribution of means was obtained from rectangular distribu-
tions of numbers.
Why go to all this bother? Why is it important to show that distributions of means
approximate normality? We work with means a great deal in data analysis, and if they are
normally distributed then one can use the known properties of the normal curve to inter-
pret obtained research data. Knowing that approximately 96 percent of the means will lie
between two standard deviations (standard errors) above and below the mean is valuable
information, because an obtained result can be assessed against the known properties of
the normal curve. In the last chapter we saw the use of the normal curve in interpreting
means. We now turn to what is perhaps a more interesting use of the curve in assessing the
differences between means.
One of the most frequent and useful strategies in research is to compare means of samples.
From differences in means we infer effects of independent variables. Any linear combina-
tion of means is is, differences in means
also governed by the central limit theorem. That
will be normally distributed, given large enough samples. (A linear combination is any
equation of the first degree, e.g., K = A/| - M2- Y = Mi' - A/, is not linear.) Therefore
we can use the same theory with differences between means that we use with means.
Suppose we have randomly assigned 200 subjects to two groups, 100 to each group.
We show a movie on intergroup relations to one group, for example, and none to the other
group. Next, we give both groups an attitude measure. The mean score of Group A (saw
the movie) is 10 and the mean score of Group B (did not see the movie) is 100. Our
1
tween the means of these samples, and go through the same experimental procedure, will
we consistently get this difference of 10? Again, we use the standard error to evaluate our
differences, but this time we have a sampling distribution of differences benveen means. It
is as if we took each M, — Mj and considered it as an X. Then the several differences
between the means of the samples are considered as the X's of a new distribution. At any
rate, the standard deviation of this sampling distribution of differences is akin to the
standard error. But this procedure is only for illustration; actually we do not do this. Here,
again, we estimate the standard error from our first two groups, A and B, by using the
formula:
where SE^^^ and SE^^ are the standard errors squared, respectively, of Groups A and
B, as previously stated.'^
Suppose we did the experiment with five double groups, that is, ten groups, two at a
time. The five differences between the means were 10, II, 12, 8, 9. The mean of these
differences is 10; the standard deviation is 1 .414. This 1.414 is again akin to the standard
error of the sampling distribution of the differences between the means, in the same sense
as the standard error of the mean in the earlier discussion. Now, if we calculate the
standard error of the mean for each group (by making up standard deviations for the two
groups, SDa = 8 and SDb = 9), we obtain:
_ SDa _ 8 _ _ ^^s _ 9 _ ^
'nA VIUU Vns VIUU
By Equation 12.2 we calculate the standard error of the differences between the means:
= V\A5= 1.20
What do we do with the 1 .20 now that we have it? If the scores of the two groups had
been chosen from a table of random numbers and there were no experimental conditions,
we would expect no difference between the means. But we have learned that there are
always relatively small differences due to chance factors. These differences are random.
The standard error of the differences between the means is an estimate of the dispersion of
these differences. But it is a measure of these differences that is an estimate for the whole
population of such differences. For instance, the standard error of the differences between
the means is 1 .20. This means that, by chance alone, around the difference of 10 between
Ma and Mb there will be random fluctuations now 10, now 10.2. now 9.8, and so on. —
Only rarely will the differences exceed, say, 13 or 7 (about three times the SE). Another
way of putting it is to say that the standard error of 1 .20 indicates the limits (if we multiply
the 1 . 20 by the appropriate factor) beyond which sample differences between the means
probably will not go.
What has all this to do with our experiment? It is precisely here that we evaluate the
experimental results. The standard error of 1.20 estimates random fluctuations. Now,
Ma — Mb = Could
have arisen by chance, as a result of random fluctuations as
10. this
just described? It should by now be halfway clear that this cannot be, except under very
unusual circumstances. We evaluate this difference of 10 by comparing it to our estimate
' ^ Other formulas are applicable under other circumstances for example , . if we start off with matched pairs of
subjects.
Testing Hypotheses and the Standard Error • 197
This means that our measured difference between M^ and Mb would be 8.33 standard
deviations away from a hypothesized mean of zero (zero difference, no difference be-
tween the two means).
We would not have any difference, theoretically, if our subjects were well randomized
and there had been no experimental manipulation. We would have, in effect, two distribu-
tions of random numbers from which we could expect only chance fluctuations. But here
we have, comparatively, a huge difference of 10, compared to an insignificant 1.20 (our
estimate of random deviations). Decidedly, something is happening here besides chance.
And this something is just what we are looking for. It is, presumably, the effect of the
movie, or the effect of the experimental condition, other conditions having been suffi-
ciently controlled, of course.
Look at Figure 12.2. It represents a population of differences between means with a
mean of zero and a standard deviation of 1.20. (The mean is set at zero, because we
assume that the mean of all the mean differences is zero.) Where would the difference of
10 be placed on the base line of the diagram? In order to answer this question, the 10 must
first be converted into standard deviation (or standard error) units. (Recall standard scores
from the last chapter.) This is done by dividing by the standard deviation (standard error),
which is 1.20: 10/1.2 = 8.33. But this is what we got when we calculated the t ratio. It
is, then, simply the difference between Ma and A/g, 10, expressed in standard deviation
(standard error) units. Now we can put it on the base line of the diagram. Look far to the
right for the dot. Clearly the difference of 10 is a deviate. It is so far out, in fact, that it
probably does not belong to the population in question. In short, the difference between
Ma and Mg is statistically significant, so significant that it amounts to what Bernoulli
called "moral certainty." Such a large difference, or deviation from chance expectation,
198 • Analysis, Interpretation, Statistics, and Inference
can hardly be attributed to chance. The odds are actually greater than a billion to one. It
can happen. But it is hardly likely to happen.'''
Such is the standard error and its use. The standard errors of other statistics are used in
the same way. A very important and useful tool. It is a basic instrument in contemporary
research. Indeed, it would be hard to imagine modem research methodology, and impos-
sible to imagine modem statistics, without the standard error. As a key to statistical
inference its importance cannot be overestimated. Much of statistical inference boils down
to a family of fractions epitomized by the fraction:
Statistic
STATISTICAL INFERENCE
To infer is to derive a conclusion from premises or from evidence. To infer statistically is
inferred that the whole population of the United States, if asked, will respond similarly.
This is a rather big inference. One of the gravest dangers of research or perhaps I should —
say, of any human reasoning —
is the inferential leap from sample data to population fact.
It can be said, in sum. that statistics enables scientists to test substantive hypotheses
indirectly by enabling them to test statistical hypotheses directly (if it is at all possible to
testanything directly). In this process, they use null hypotheses, hypotheses written by
chance. They test the "tmth" of substantive hypotheses by subjecting null hypotheses to
statistical tests on the bases of probabilistic reasoning. They then make appropriate infer-
'"*
ences. Indeed, the objective of all statistical tests is to test the justifiability of inferences.
is: How large a difference, or in the language of statistics, how far away from the
'*
An important question
hypothetical mean of zero must a deviation be to be significant? This question cannot be answered definitively in
this book. The .05 level is .96 standard deviations from the mean, and the .01 level is 2.58 standard deviations
1
from the mean. But there are complications, especially with small samples. The student must, as usual, study a
good statistics text. A simple rule is; 2 standard deviations (SB's) are significant (about the .05 level); 2.5
standard deviations are very significant (about the .01 level); and 3 standard deviations are highly significant (a
little less than the .001 level).
'*
A reviewer of this chapter has questioned the message the chapter implies, namely that all statistical tests
of hypotheses involve standard errors. This implication would be unfortunate. Indeed, as we will see in later
chapters, other means of assessing statistical significance are often used. For example, the nonparametric
analysis of variance tests presented in Chapter 16 depend on ranking, and the complex tests of analysis of
covariance structures of Chapter 36 depend on comparisons of covariances (correlations).
Testing Hypotheses and the Standard Error • 199
Study Suggestions
Edwards, A. Siatisiical Analysis. 3d ed. New York: Holt, Rinehart and Winston, 1969. A
good book for the beginning student: clear and readable. See, especially, chaps. 3, 4, 10,
and 11.
Freedman, a., Pksani, R., and Purves, R. Statistics. New York: Norton, 1978. Accessible to
the beginning student. Good discussions of interesting studies and problems. Applications
oriented.
Hays, W. Stati.'itics. 3d ed. New York: Holt, Rinehart and Winston, 1981. Superb: thorough,
authoritative, research-oriented — but not elementary. Its careful study should be a goal of
serious students and researchers.
McNemar, Q. Psychological Statistics, 4th ed. New York: Wiley, 1969. Excellent book: clear,
comprehensive, helpful.
Snedecor, G. and Cochran, W.
, Statistical Methods. 6th ed. Ames, Iowa: Iowa State Univer-
sity Press, 1967. Solid, authoritative, helpful, but not elementary. Excellent reference
book.
2. The proportions of men and women voters in a certain county are .70 and .30, respectively.
In one election district of 400 people, there are 300 men and 100 women. Can it be said that the
district's proportions of men and women voters differ significantly from those of the county?
[Answer: Yes. )C = 4.76. )C table entry. .05 level, for df = I: 3.84.]
3. An methods of answering the
investigator in the field of prejudice experimented with various
prejudiced person's remarks about minority group members. He randomly
assigned 32 subjects to
two groups, 16 in each group. With the first group he used method A; with the second group he used
method B. The means of the two groups on an attitude test, administered after the methods were
used, were A: 27; B: 25. Each group had a standard deviation of 4. Do the two group means differ
significantly?
[Answer: No. (27 - 25)/l.414 = 1.414.)
4.The evenly distributed 4000 random numbers discussed in the text and the statistics calcu-
lated from the random numbers are given in Appendix C at the end of the book. Use a table of
— —
random numbers the 4000 random numbers will do and wave a pencil in the air with eyes closed
and let it come to rest at any point in the table. Going down the columns from the place the table was
entered, copy out 10numbers in the range from through 40. Let these be the numbers of ten of
1
the40 groups. The means, variances, and standard deviations are given right after the table of 4000
random numbers. Copy out the means of the groups randomly selected. Round the means; i.e.,
54.33 becomes 54, 47.87 becomes 48. and so on,
(a) Calculate the mean of the means, and compare it to the population mean of 50 (really
50.33). Did you come close?
(b) Calculate the standard deviation of the 10 means.
(c) Take the first group selected and calculate the standard error of the mean, using N = \00
and the reported standard deviation. Do the same for the fourth and ninth groups. Are
the SE^'i alike? Interpret the first SE^. Compare the results of (b) and (c).
(d) Calculate the differences between the first and sixth means and the fourth and tenth
means. Test the two differences for statistical significance. Should they be statistically
significant? Give the reason for your answer. Make up an experimental situation and
imagine that the fourth and tenth means are your results. Interpret.
(e) Discuss the central limit theorem in relation to (d), above.
200 • Analysis, Interpretation, Statistics, and Inference
5. To now. and the standard deviation have been calculated with N in the denomi-
the variance
nator. In statistics books, the student will encounter the variance formula as: V = S x'/N. or V =
S .r/{N - 1). The first formula is used when only describing a sample or population. The second is
used when estimating the variance of a population with the sample variance lor standard deviation).
With N large, there is little practical difference. In Part Five, we will see that the denominators of
variance estimates always have A' - I. k — 1 and so on. These are really degrees of freedom. Most
,
computer programs use A^ - 1 to calculate standard deviations. Perhaps the best advice is to use
N- 1 always. Even when it is not appropriate, it will not make that much difference.
6. Statistics is not always viewed favorably. Marxists, for example, are not too sympathetic.
(Why, do you suppose?) Here is an interesting study in education in which a design with a control
group was used, but no statistical tests of significance or measures of the magnitude of relations
were used!: E. DeCorte and L. Verschaffei, "Children's Solution Processes in Elementary Arithme-
tic Problems: Analysis and Improvement," Journal of Educational Psychology. 73 (1981), 765-
N:
PART FIVE
ANALYSIS
OF VARIANCE
Chapter i 3
Analysis of Variance:
Foundations
(Notice that none of the .v's has a power greater than 1 that is. there are no .v-'s or.r-''s.) If
.
we conceive a score of one individual, y, as having one or more sources of variance, X\,
.Vt, . . . , then we can roughly grasp the idea of the model.' The fc's are weights that
express the relative degrees of influence of the .v's in accounting for y. e is error; it
'
For a lucid, brief explanation of the general linear model and analysis of variance models, see W. Hays,
Statistics. 3d ed. New York: Holt. Rinehart and Winston. 1981. pp. 327ff.
204 • Analysis of Variance
and research problems. To accomplish the pedagogical purpose, simple examples will be
used. It makes little difference whether 5 scores or 500 scores are used, or if 2 or 20
variables are used. The fundamental ideas, the theoretical conceptions, are the same. In
this chapter, simple one-way analysis of variance is discussed. The next two chapters
consider so-called factorial analysis of variance and the analysis of variance of correlated
groups or subjects. By then the student should have a good basis for the study of research
design."
groups), we use k - \. While this method has a great advantage from a statistical point of
Table 13.1 Two Sets of Hypothetical Experimental Data with Sums, Means, and Sums of Squares
A,
Analysis of Variance: Foundations • 205
Table
206 • Analysis of Variance
separately and tiien averaging these separate variances. We do this using the figures given
in Table 13.1. Each group has Sjc" = 10. Dividing each of these sums of squares by its
riA, - 1
and
Analysis of Variance: Foundations • 207
statistical significance, whereas the t test applies only to two groups. (With two groups, as
we shall see shortly, the results of the two methods are really identical.) The method of
analysis of variance uses variances entirely, instead of using actual differences and stand-
ard errors, even though the actual difference-standard error reasoning is behind the
method. Two variances are always pitted against each other. One variance, that presuma-
bly due to the experimental (independent) variable or variables, is pitted against another
variance, that presumably due to error or randomness. This is a case, again, of informa-
tion versus error, as Diamond put it,* or, as information theorists say, information versus
noise. To get a grip on this idea, go back to the problem.
We found that the between-groups variance was .50. Now we must find a variance that
is a reflection of error. This is the within-groups variance. After all, since we calculate the
within-groups variance, essentially, by calculating the variance of each group separately
and then averaging the two (or more) variances, this estimate of error is unaffected by the
differences between the means. Thus, if nothing else is causing the scores to vaiy. it is
reasonable to consider the within-groups variance a measure of chance fluctuation. If this
is so, then we can stack up the variance due to the experimental effect, the between-
groups variance, against this measure of chance error, the within-groups variance. The
only question is: How is the within-groups variance calculated?
Remember that the variance of a population of means can be estimated with the
standard variance of the mean (the standard error squared). One way to obtain the within-
groups variance is to calculate the standard variance of each of the groups and then
average them for all of the groups. This should yield an estimate of error that can be used
to evaluate the variance of the means of the groups. The reasoning here is basic. To
evaluate the differences between the means, it is necessary to refer to a theoretical popula-
tion of means that random sampling of groups of scores like the
would be gotten from the
groups of scores we we have two means from samples with five
have. In the present case,
scores in each group. (It is well to remember that we might have three, four, or more
means from three, four, or more groups. The reasoning is the same.) If subjects were
assigned to the groups at random, and nothing has operated that is. there have been no —
experimental manipulations and no other systematic influences at work then it is possi- —
ble to estimate the variance of the means of the population of means with the standard
variance of the means {SEn/. or SV^). Each group provides such an estimate. These
estimates will vary to some extent among themselves. We can pool them by averaging to
form an overall estimate of the variance of the population means.
mean formula was: SE^, = SD/Vn. Simply square
Recall that the standard error of the
this expression to get the standard variance of the mean: SEm' = (SD)~/n = SV^ = V/n.
The variances of each of the groups was 2.5. Calculating the standard variances, we
obtain for each group: SV^^ = V/n = 2.50/5 = .50. Averaging them obviously yields .50.
Note carefully that each standard variance was calculated from each group separately and
then averaged. Therefore this average standard variance is uninfluenced by differences
between the means, as noted earlier. The average standard variance, then, is a within-
groups variance. It is an estimate of random errors.
But if random numbers had been used, the same reasoning applies to the between-
groups variance, the variance calculated from the actual means. We calculated a variance
from the means of 4 and 3: it was .50. If the numbers were random, estimating the
variance of the population of means should be possible by calculating the variance of the
obtained means.
Note carefully, however, that if any extraneous influence has been at work, if any-
*S. Diamond, Information and Error. New York: Basic Books, 1959.
Analysis of Variance: Foundations • 209
thing liice experimental effects have operated, then no longer will the variance calculated
from the obtained means be a good estimate of the population variance of means. If an
experimental intlucnce — or some influence other than chance — has operated, the effect
may be to increase the variance of the obtained means. In a sense, this is the purpose of
experimental manipulation: to increase the variance between means, to make the means
differentfrom each other. This is the crux of the analysis of variance matter, //an experi-
mental manipulation has been influential, then it should show up in differences between
means above and beyond the differences that arise by chance alone. And the between-
groups variance should show the intluence by becoming greater than expected by chance.
Clearly we can use V/,. then, as a measure of experimental influence. Equally clearly, as
we showed above, we can use V„. as a measure of chance variation. Therefore, we have
almost reached the end of a rather long but profitable journey: we can evaluate the be-
tween-groups variance. Vi„ with the within-groups variance, V„,. Or information, experi-
mental information, can be weighed against error or chance.
It might be possible to evaluate Vh by subtracting V„. from it. In the analysis of
variance, however, V^ is divided by V,,.. The ratio so formed is called the F ratio. (The F
ratio was named by Snedecor in honor of Ronald Fisher, the inventor of the analysis of
variance. It was Snedecor who worked out the F tables used to evaluate F ratios.) One
calculates the F ratio from observed data and checks the result against an F table. (The F
table with direction for its use can be found in any statistics text. ) If the obtained F ratio is
as great or greater than the appropriate tabled entry, the differences that Vf, reflects are
statistically significant. In such a case the null hypothesis of no differences between the
means is rejected at the chosen level of significance. In the present case:
Vi, .50
V„, .50
One obviously does not need the F table to see that the F ratio is not significant.
Evidently the two means of 4 and 3 do not differ from each other significantly. In other
words, of the many possible random samples of pairs of groups of five cases each, this
one of them. Had the difference been considerably greater,
particular case could easily be
great enough to tip the F-ratio balance scale, then the conclusion would have been quite
different, as we shall see.^
Suppose that the investigator had obtained quite different results. Say the means had been
6 and 3, rather than 4 and 3. We now take the above example and add a constant of 2
to each Ai score. This operation of course merely restores the scores used in Chapter 6.
It was said earlier that adding a constant to a set of scores (or subtracting a constant)
changes the mean by the constant but has no effect whatever on the variance. The figures
are given in Table 13.5.
'Note that the and analysis of variance yielded the same result. With only two groups, or one degree
t test
of freedom (k - \). F=
r. or t = VF. This equality shows that it does not matter, in the case of two groups,
whether ; or f is calculated. (But the analysis of variance is a bit easier to calculate than /. in most cases.) With
three or more groups, however, the equality breaks down; F must always be calculated. Thus F is the general test
of which r is a special case.
210 • Analysis of Variance
Table 13.5 Hypothetical Experimental Data for Two Groups: Table 13.1 Data Altered
Analysis of Variance: Foundations • 211
seems that the difference of 3 is statistically significant difference at the .05 level. There-
fore, 6 7^ 3. and the null hypothesis is rejected.
'To do the calculations in this book, students are urged to use only desk or hand-held calculators that can
cumulate sums and sums of squares. Occasionally access to a microcomputer or larger computer will be helpful,
even necessary. In general, however, one should get the feel of analysis of variance (and other methods)
"through the hands." Students are strongly advised against using package programs (SPSS. BMDP, SAS, and
the like) to do the exercises, except when explicitly advised to do so. Some calculations, of course, cannot be
done "by hand." or are extremely difficult to do. The principle is (or should be): Understand what you're doing
all along the way. To relieve the monotony of repetitive calculations of sums and sums of squares, it is highly
useful if one has a programmable calculator or a microcomputer and an easily used program to calculate sums
and sums of squares.
212 • Analysis of Variance
Xa, X A\ - X.A,
Analysis of Variance: Foundations • 213
ratio by dividing the within or error variance or mean square into the between variance or
mean square: F = ms,Jms„ = 22.50/2.50 = 9. This final F ratio, also called the variance
ratio, is checked against appropriate entries in an F table to determine its significance, as
discussed previously.
A RESEARCH EXAMPLE
To illustrate the research use of one-way analysis of variance, data from an early experi-
mental study by Hurlock, described earlier in this book, are given in Table 13.8.' The data
were not analyzed in this manner by Hurlock, the analysis of variance not being available
at the time of the study. Hurlock divided 106 fourth- and sixth-grade pupils into four
groups, £|, £2, £,, and C. Five forms of an addition test. A, B. C, D. and £, were used.
Form A was administered to all the 5"s on the first day. For the next four days the
experimental groups, £|, £:, and £3, were given a different form of the test. The control
group, C, was separated from the other groups and given different forms of the test on
four separate days. The 5"s of Group C were told to work as usual. But each day before
the tests were given, the £| group was brought to the front of the room and praised for its
good work. Then the £: group was brought forward and reproved for its poor work. The
members of the £3 group were ignored. On the fifth day of the experiment. Form £ was
administered to all groups. Scores were the number of correct answers on this form of the
test. Summary data are given in Table 13.8, together with the table of the final analysis of
variance.
Since £= 10.08, which is significant at the .001 level, the null hypothesis of no
differences between the means has to be rejected. Evidently the experimental
manipula-
tions were effective. There is not much difference between the Ignored and Control
groups, an interesting finding. The Praised group has the largest mean, with the Reproved
group mean in between the Praised group and the other two groups. The student can
complete the interpretation of the data."'
Table 13.8 Summary Data and Analysis of Variance of Data from Hurlock Study
214 • Analysis of Variance
of the table reflect the dependent variable; arithmetic achievement in the Hurlock exam-
ple. The analysis of variance works with the relation between these two kinds of variables,
if the independent variable has had an effect on the dependent variable, then the "equal-
ity" of the means of the experimental groups that would be expected if the numbers being
analyzed were random numbers is upset. The effect of a really influential independent
variable is to make means unequal. We can say, then, that any relation that exists between
the independent and dependent variables is reflected in the inequality of the means. The
more unequal the means, the wider apart they are, the higher the relation, other things
equal.
If between the independent variable and the dependent variable, then
no relation exists
it is as though sets of random numbers, and consequently, random means. The
we had
differences between the means would only be chance fluctuations. An F test would show
them not to be significantly different. If a relation does exist, if there is a tie or bond
10
9
Method A, 9 9
8
7
7
Method At 7
7
5
4
Method A3 4
3
Analysis of Variance: Foundations • 215
Method /\ 10 7.25
7
3
5
Method Ai 4 5.25
9
7
7
Method A3 7 7.50
9
between the independent and dependent variables, the iinposition of different aspects of
the independent variable, like different methods of instruction, should make the measures
of the dependent variable vary accordingly. Method A] might make achievement scores
go up, whereas method A2 might make them go down or stay about the same. Note that we
have the same phenomenon of concomitant variation that we did with the correlation
coefficient. Take two extreme cases: a strong relation and a zero relation. We lay out a
hypothetically strong relation between methods and achievement in Table 13.9. Note that
the dependent variable scores vary directly with the independent variable methods:
Method A\ has high scores, method A2 medium scores, and method A3 low scores. The
relation is also shown by comparing methods and the means of the dependent variable.
Compare the example of Table 13.9 with chance expectation. If there were no relation
between methods and achievement, then the achievement means would not covary with
methods. That is, the means would be nearly equal. In order to show this, I wrote the 12
achievement scores of Table 13.9 on separate slips of paper, mixed them up thoroughly in
a hat, threw them all on the floor, and picked them up 4 at a time, assigning the first four
to A|, the second four to A-., and the third four to A3. The results are shown in Table
13.10.
Nowit is difficult, or impossible, to "see" a relation. The means differ, but not
much. Certainly the relation between methods and achievement scores (and means) is not
nearly as clear as it was before. Still, we have to be sure. Analyses of variance of both sets
of data were performed. The F ratio of the data of Table 13.9 (strong relation) was 57.59,
highly significant, whereas the F ratio of the data of Table 13.10 (low or zero relation)
was 1.29, not significant. The confirm our visual impressions. We now
statistical tests
know that there is a relation between methods and achievement in Table 13.9 but not in
Table 13.10.
The problem, however, is to show the relation between significance tests like the F
testand the correlation method. This can be done in several ways. We illustrate with two
such ways, one graphical and one statistical. In Figure 13.1 the data of Tables 13.9 and
13.10 have been plotted much as continuous X and Y measures in the usual correlation
problem are plotted, with the independent variable — —
Methods on the horizontal axis,
and the dependent variable — —
Achievement on the vertical axis, as usual. To indicate the
216 • Analysis of Variance
lOi-
Methods Methods
(TABLE 13.9 data) (TABLE 13.10 data)
Figure 13.1
relation, lines have been drawn as near to the means as possible. A diagonal line making a
45-degree angle with the horizontal axis would indicate a strong relation. A horizontal line
across the graph would indicate a zero relation. Note that the plotted scores of the data of
Table 13.9 clearly indicate a strong relation: the height of the plotted scores (crosses) and
the means (circles) varies with the method. The plot of the data of Table 13. 10, even with
a rearrangement of the methods for purposes of comparison, shows a weak relation or no
relation.
Let us now look at the problem statistically. It is possible to calculate correlation
coefficients with data of this kind. If one has done an analysis of variance, a simple (but
not entirely satisfactory) coefficient is yielded by the following formula:
V = (13.4)
Of course, ssi, and ss, are the between-groups sum of squares and the total sum of squares,
respectively. One simply takes these sums of squares from the analysis-of- variance table
to calculate the coefficient. 17, usually called the correlation ratio, is a general coefficient
or index of relation often used with data that are not linear. (Linear, roughly speaking,
means that, if two variables are plotted one against another, the plot tends to follow a
straight line. This is another way of saying what was said in Chapter 12 about linear
combinations.) Its values vary from to 1 .00. We are interested here only in its use with
analysis of variance and in its power to tell us the magnitude of the relation between
independent and dependent variables.
Recall that the means of the data of Table 13.1 were 3 and 4. They were not signifi-
cantly different. Therefore there is no relation between the independent variable (meth-
ods) and the dependent variable (achievement). If an analysis of variance of the data of
Table 1 3.1done using the method outlined in Table 13.7, ssh = 2.50 and ss, = 22.50.
is ,
17 = V2. 50/22. 50 = V~lll = -33 yields the correlation between methods and achieve-
ment. Since we know that the data are not significant (F = 1), 17 is not significant. In
other words, tj = .33 is here tantamount to a zero relation. Had there been no difference
at all between the means, then, of course, tj = 0. If sst = ss,, then 17 = 1.00. This can
happen only if all the scores of one group are the same, and all the scores of the other
group are the same as, and yet different from, those of the first group, which is highly
unlikely. For example, if the A\ scores were 4, 4, 4, 4, 4, and the A^ scores were 3, 3,
3,3,3, then sSh = ss, = 2.5, and 77 = V2.5/2.5 = 1 It is obvious that there is no within-
.
groups variance —
again, extremely unlikely.
Analysis of Variance: Foundations •
217
Take the data of Table 13.7. The means are 6 and 3. They are significantly different,
since F= 9. Calculate tj;
^ ss, ^42.50
Note the substantial increase in rj. And since F is significant, tj = .73 is significant. There
is a substantial relation between methods and achiev ement.
- \)ms^
= ssb-
2 {k
w^ (13.5)
ss, + mi„,
where k = number of groups in the analysis of variance and the other terms are the sums
of squares and mean squares defined earlier, w" is a conservative estimate of the strength
X and the dependent variable
of association or relation between the independent variable
Y, or between the variable reflected by the experimental treatment and the dependent
variable measure. Calculating w" for the Huriock example,
, - 1260.06- (4 - 1)(41.66)
''
' '
= ,205
5509.35 -^ 41.66
This is rather close to the value of tj-, .23. w' is comparable to tj- rather than to tj. Both
indices indicate the proportion of variance in a dependent variable due to the presumed
'-
influence of an independent variable.
however, if it is used. Actually, the three coefficients, rf-, Rl. and w-. estimate the same thing: the proportion of
variance of the dependent variable accounted for by the independent variable. Here is the formula for Rl, the
intraclass coefficient:
nj _ rnsi, - ms„
mSh + (rii
- I)mi„,
For the Huriock data, Rl = .26, somewhat larger than w^ and rj-. To be analogous to product-moment
correlation coefficients, strictly speaking, one calculates r) and the square roots of/?/ and o)-. This is not usually
done, however.
218 • Analysis of Variance
The point of the above discussion has been to bring out the similarity of conception of
these and other indices of association or correlation and, more important, the similarity of
the principle and structure of analysis of variance and correlation methods. From a practi-
cal and applied standpoint, it should be emphasized that tj-, w", or other measures of
association should always be calculated and reported. It is not enough to report F ratios
and whether they are statistically significant. We must know how strong relations are.
After all, with large enough N's. F and t ratios can almost always be statistically signifi-
cant. While often sobering in their effect, especially when they are low, coefficients of
association of independent and dependent variables are indispensable parts of research
results.
Suppose an experiment like Hurlock's has been done and the experimenter has the data of
Table 13.8. He knows that the overall differences among the means are statistically signif-
icant. But he does not know which differences contribute to the significance. Can he
simply test the differences between all pairs of means to tell which are significant? Yes
and No, but generally No. Such tests are not independent and, with sufficient numbers of
tests, one of them can be significant by chance. In short, such a "shotgun" procedure
capitalizes on chance. Moreover, they are blind and what has been called "no-headed."
There are several ways to do post hoc tests, but we mention only one of them
briefly. ''' The Scheffe test, if used with discretion, is a general method that can be applied
to all comparisons of means after an analysis of variance.''* If and only if the F test is
significant, one can test all the differences between means: one can test the combined
mean of two or more groups against the mean of one other group; or one can select any
combination of means against any other combination. Such a test with the ability to do so
much is very useful. But we pay for the generality and usefulness: the test is quite
conservative. To attain significance, differences have to be rather substantial. The main
point is that post hoc comparisons and tests of means can be done, mainly for exploratory
and interpretative purposes. One examines one's data in detail: one rummages for insights
and clues.
Since it would take us too far afield, the mechanics of the Scheffe test are not given
"For an excellent description of such tests, see T. Ryan, "Multiple Comparisons in Psychological Re-
search." Psychological Bulletin, 56 (1959), 26-47. See, also, T. Ryan, "Significance Tests for Multiple
Comparisons of Proportions, Variances, and Other Statistics." Psychological Bulletin, 57 (1960), 318-328.
Perhaps the best reference is R. Kirk, Experimental Design: Procedures for the Behavioral Sciences. Belmont,
Calif; Brooks/Cole. 1968. chap. 3.
"H. Scheffe, "A Method for Judging All Contrasts in the Analysis of Variance." Biometrika, (1953).
87-104. See, also, Hays, op, cit., p. 433ff. Hays' discussion of the problems of this section is particularly good.
Analysis of Variance: Foundations • 219
here. (But see Study Suggestion 6 at the end of the chapter.) Suffice it to say that, when
apphed to the Hurlock data of Table 13. S, it shows that the Praised mean is significantly
greater than the other three means and that none of the other differences is significant.
This is important information, because it points directly to the main source of the signifi-
cance of the overall F ratio: praise versus reproof, ignoring, and control. (However, the
difference between an average of means and 2 versus an average of means 3 and 4 is
1
also statistically significant.) Although one can see this from the relative sizes of the
means, the Scheffe test makes things precise — in a conservative way.
Planned Comparisons
While post hoc tests are important in actual research, especially for exploring one's data
and for getting leads for future research, the method of planned comparisons is perhaps
more important scientifically. Whenever hypotheses are formulated and systematically
tested and empirical results support them, this is much more powerful evidence on the
empirical validity of the hypotheses than when "interesting'" (sometimes translate: "sup-
port my predilections") results are found after the data are obtained. This point was made
in Chapter 2 where the power of hypotheses was explained.
In the analysis of variance, an overall F test, if significant, simply indicates that there
are significant differences somewhere Inspection of the means can tell one,
in the data.
though imprecisely, which differences are important. To test hypotheses, however, more
or less controlled and precise statistical tests are needed. There is a large variety of
possible comparisons in any set of data that one can test. But which ones? As usual, the
research problem and the theory behind the problem should dictate the statistical tests.
One designs research in part to test substantive hypotheses.
Suppose the reinforcement theory behind the Hurlock study said, in effect, that any
kind of attention, positive or negative, will improve performance, and that positive rein-
forcement will improve it more than punishment. This would mean that f, and £2 of
Table 13.8, taken together or separately, will be significantly greater than £3 and C taken
together or separately. That is, both Praised (positive reinforcement) and Reproved (pun-
ishment) will be significantly greater than Ignored (no reinforcement) and Control (no
reinforcement). In addition, the theory says that the effect of positive reinforcement is
greater than the effect of punishment. Thus Praised will be significantly greater than
Reproved. These implied tests can be written symbolically:
//,: C, = ^
Mi + M^--
>
M3 +—
-^ M4-
have been radically changed, that is. the plan and design of the research have changed
under the impact of the theory and the research problem.
When the Scheffe test is used, the overall F ratio must be significant because none of
the Scheffe tests can be significant if the overall F is not significant. When planned
comparisons are used, however, no overall F test need be made. The focus is on the
planned comparisons and the hypotheses. The number of comparisons and tests made are
limited by the degrees of freedom. In the Hurlock example, there are three degrees of
freedom {k - 1 ); therefore, three tests can be made. These tests have to be orthogonal to
each other —that is, they must be independent. We keep the comparisons orthogonal by
220 • Analysis of Variance
using what are called orthogonal coefficients, which are weights to be attached to the
means in the comparisons. The coefficients, in other words, specify the comparisons. The
coefficients or weights for H] and Hj. above, are:
For comparisons to be orthogonal, two conditions must be met: the sum of each set of
weights must equal 0, and the sum of the products of any two sets of weights must also be
zero. It is obvious that both of the above sets sum to zero. Test the sum of the products:
(1/2)(1) + (l/2)(-l) + (-l/2)(0) + (-l/2)(0) = 0. Thus the two sets of weights are or-
thogonal.
It is important to understand orthogonal weights, as well as the two conditions just
given. The first set of weights simply stands for: (Mj + M2V2 — (M3 + Mi)l2. The sec-
ond set stands for: Mj — A/^. Now, suppose we also wanted to test the notion that the
Ignored mean is greater than the Control mean. This is tested by: A/3 — M4, and is coded:
^3: 1 - 1 . Henceforth, we will call these weight vectors. The values of the vector sum
to zero. What about its sum of products with the other two vectors?
Hx X Hi. (l/2)(0) + (l/2)(0) + (-1/2)(1) + (-l/2)(-l) =
7/2X^3: (1)(0) + (-1)(0) + (Q)(\) + (0)(-l)=
The third vector is orthogonal to, or independent of, the other two vectors. The third
comparison can be made. If these three comparisons are made, no other is possible
because the available <: — 1 = 4 — 1 = 3 degrees of freedom are used up.
Suppose, now, that instead of the H^, above, we wanted to test the difference between
the average of the first three means against the fourth mean. The coding is: 1/3 1/3
1/3 — 1 This is tantamount to: (M, + M2 + M3)/3 - M4. Is the vector orthogonal to the
.
Since the sum of the products does not equal zero, it is not orthogonal to the first vector,
and the comparison should not be made. The comparison implied by the vector would
yield redundant information. In this case, the comparison using the third vector supplies
information already given in part by the first vector.
The method of calculating the significance of the differences of planned comparisons
need not be detailed. Besides, at this point we do not need the actual calculations. Our
purpose, we hope, is a larger one: to show
and power of analysis of variance
the flexibility
when properly conceived and understood. F tests (or t tests) are used with each compari-
son, or, in this case, with each degree of freedom. The details of calculations can be found
in Hays' and other texts. '^ The basic idea of planned comparisons is quite general, and we
'-''Hays, op. cit., chap. 12. In the Hurlock example. //| and A/^. above, were both statistically significant at
a time and gives us a powerful lever for solving measurement problems. It increases the
possibilities of making experiments exact and precise. It also permits us to test several
hypotheses at one time, as well as to test hypotheses that cannot be tested in any other
way. at least with precision. Thus its generality of application is great.
More germane to the purposes of this book, the analysis of variance gives us insight
into modem research approaches and methods.
It does this by focusing sharply and con-
stantlyon variance thinking, by making clear the close relation between research prob-
lems and statistical methods and inference, and by clarifying the structure, the architec-
ture, of research design. It is also an important step in understanding contemporary
multivariate conceptions of research because it is an expression of the general linear
model.
The model of this chapter is simple and can be written:
y = Oo + A + e
Study Suggestions
1. There are many good references on analysis of variance, varying in difficulty and clarity of
explanation. Hays' discussion {op. cit.. pp. 325-348). which includes the general linear model, is
as usual excellent, but not easy. It is highly recommended for careful study. The following two
books are very good indeed. Both are staples of statistical diet.
Edwards, A. Experimental Design in Psychological Research, 4th ed. New York: Holt, Rine-
and Winston. 1972.
hart
Kirk, R. Experimental Designs: Procedures for the Behavioral Sciences. Belmont, Calif.:
Brooks/Cole, 1968.
Some students may like to read an interesting history of analysis of variance, especially in psychol-
ogy, followed by a history of the .05 level of statistical significance.
Rucci, A., and Tweney, R. "Analysis of Variance and the 'Second Discipline' of Scientific
Psychology: A Historical Account." Psychological Bulletin, 87 (1980), 166-184.
CowLES, M., and Davis. C. "On the Origins of the .05 Level of Statistical Significance."
Psychological Bulletin, 89 (1982), 553-558.
2. A university professor has conducted an experiment to test the relative efficacies of three
methods of instruction: Ai, Lecture; A;, Large-Group Discussion; and Aj, Small-Group Discussion.
From a universe of sophomores, 30 were selected at random and randomly assigned to three groups.
The three methods were randomly assigned to the three groups. The students were tested for their
achievement at the end of four months of the experiment. The scores for the three groups are given
below.
Test the null hypothesis, using one-way analysis of variance and the .01 level of significance.
Calculate t/- and w'. Interpret the results. Draw a graph of the data similar to those in the text.
222 • Analysis of Variance
Analysis of Variance: Foundations • 223
of the group to which they then ostensibly belonged. The means and standard deviations of the total
ratings are severe: M = 195.3. SD = 31.9; miUl: M = 171.1, SD = 34.0; conirol: M = 166.7,
SD = 21.6. Each n was 21.
(a) Do an analysis of variance of these data. Use the method outlined in the addendum to
this chapter. Interpret the data. Is the hypothesis supported?
(b) Calculate ai". Is ihc relation strong? Would you expect the relation to be strong in an
experiment of this kind?
6. Use the Scheffe test to calculate the significance of all the differences between the three
means of Study Suggestion 2. above. Here is one way to do the Scheffe test. Calculate the standard
error of the differences between two means with the following formula:
SEm-m,= VmsJ—
= \JmsJ—
"
^
+
+ —j — (13.6)
\ n, ^ iij I
where mi„. = within-groups mean square, and n, and ;;, are the numbers of cases in groups / and/
For the example, this is:
where k = number of groups in the analysis of variance, and the F term is the .05 level F ratio
obtained from an F table at <: - 1 (3 - 1 = 2) and m =N- A- = 30 - 3 = 27 degrees of freedom.
This is 3.35. Thus,
The final step is to multiply the results of Formulas 13.6 and 13.7:
Any difference, to be statistically significant at the .05 level, must be as large or larger than 2.10.
Now use the statistic in the example.
7. Studies that have used one-way analysis of variance are relatively infrequent. Here are five
of them. Select two for study. Pay particular attention to post hoc tests of the significance of the
differences between means.
Allen. D. "Some Effects of Advance Organizers and Level of Questions on the Learning and
Retention of Written Social Studies Materials." Journal of Educational Psychology. 56
(1970), 333-339.
GoLiGHTLY, C, and Byrne, D. "Attitude Statements as Positive and Negative Reinforce-
ments." Science. 146 (1964), 798-799.
Jones, S., and Cook, S. "The Influence of Attitude on Judgments of the Effectiveness of
Alternative Social Policies." Journal of Personality and Social Psychology. 32 (1975),
767-773.
SiLVERSTEiN, B. "Cigarette Smoking, Nicotine Addiction, and Relaxation." Journal of Per-
sonality and Social Psychology. 42 (1982), 946-950.
WiTTROCK, M. "Replacement and Nonreplacement Strategies in Children's Problem Solving."
Journal of Educational Psychology. 58 (1967). 69-74.
224 • Analysis of Variance
ADDENDUM
Analysis of Variance Calculations with Means, Standard Deviations, and n's
(1) From the n's and M"s calculate the sums of the groups, 2X,. Add these to obtain SA",.
Calculate total A^ from the n\ of the groups.
ssb = '^[tijX-j] - C
ssh = [i6-){5) + (3-)(5)] - C = 225.00 - 202.50 = 22.50.
(5) Set up analysis of variance table (as in Table 13.7), and calculate mean squares and F
ratio.
Special Note: The method assumes that the original standard deviations were calculated with n — I.
If they were calculated with/!, alter step 3, above: (1.4 142-)(5) + (1.4142-)(5) = 20. I.e., change 4
to 5, or « - 1 to n.
Chapter i 4
Factorial Analysis of
Variance
We now study the statistical and design approach that epitomizes the true beginning of the
modem behavioral science research outlook. The idea of factorial design and factorial
analysis of variance one of the creative research ideas of the last fifty or more years.
is
and potent phenomenon. Is prejudice so pervasive and subtle that it can work "the other
way"? Do people who believe themselves free of prejudice discriminate positively toward
minorities? Is there such a thing, in other words, as "inverse prejudice"? Is some of the
hiring of blacks and women by business firms and universities prompted by inverse
226 • Analysis of Variance
prejudice — or is it simply good business? Such questions can of course be easily asked.
They answered
are not so easily —
at least scientifically.
Race
Factorial Analysis of Variance • 227
Black
.50 - Panhandler
.40 -
.30
White
O Panhandler
c
3 .20
O
E
<
.10
Low
Threat
228 • Analysis of Variance
Table 14.2 Design and Results (Means) of Smith and Gotten Study: 2x2
Factorial''
Discontinuity
Factorial Analysis of Variance • 229
examples And we can add more independent variables. The only limitations are
later.
practical ones: how to handle so many variables at one time and how to interpret interac-
tions, especially triple and quadruple ones. What we are after, however, are the basic
ideas behind factorial designs and models.
One of the most significant and revolutionary developments in modem research design
and statistics is the planning and analysis of the simultaneous operation and interaction of
two or more variables. Scientists have long known
that variables do not act independently.
Rather, they often act in concert. The
one method of teaching contrasted with
virtue of
another method of teaching depends on the teachers using the methods. The educational
effect of a certain kind of teacher depends, to a large extent, on the kind of pupil being
taught. An anxious teacher may be quite effective with anxious pupils but less effective
with nonanxious pupils. Different methods of teaching in colleges and universities may
depend on the intelligence and personality of both professors and students. In the Dutton
and Lake study, the effect of threat depended on the race of the panhandler (see Table 14.1
and Figure 14.1). In the Smith and Gotten study, the interaction was different. The joint
effect of the independent variables, vagueness and discontinuity, was cumulative: when
both were present, the effect was strongest (see Table 14.2).
Before the invention of analysis of variance and the designs suggested by the method,
the traditional conduct of experimental research was to study the effect of one independent
variable on one dependent variable. I am not implying, by the way, that this approach is
wrong. It is simply limited. Nevertheless, many research questions can be adequately
answered using this "one-to-one" approach. Many other research questions can be ade-
"
effective when alone or when coupled with level 62. and that, perhaps. At is effective
only when coupled with fi|.
The implied logic behind this sort of research thinking can be understood better by
returning to the conditional statements and thinking of an earlier chapter. Recall that a
conditional statement takes the form If p, then q, orUp, then q. under conditions r and 5.
In logical notation: p—*q
and p -^ q\r.s. Schematically, the conditional statement behind
the one-way analysis of variable problems of Chapter 13 is the simple statement Up, then
q. In the Hurlock study, if certain incentives, then certain achievement. In the Aronson
and Mills study (see Study Suggestion 5, Chapter 13), if severity of initiation, then liking
for the group.
The conditional statements associated with the research problems of this chapter, how-
ever, aremore complex and subtle: Up, then q, under conditions r and s, or q\r,s, p^
where "I" means "under condition(s)." In the Dutton and Lake study, this would be
p -^ q\r, or If threat, then reverse discrimination, under the condition that the target (the
panhandler) is black. While structurally similar, the "cumulative" logic of Smith and
cannot say "under the condition" because/? and q, vagueness and discontinuity, are equal
partners and combine to affect achievement.
on a dependent variable. More precisely, interaction means that the operation or influence
of one independent variable on a dependent variable depends on the level of another
independent variable. This is a rather clumsy way of saying what we said earlier in talking
about conditional statements, for example. If p. then q. under condition r.^ In other
words, interaction occurs when an independent variable has different effects on a depend-
ent variable at different levels of another independent variable.
The above definition of interaction encompasses two independent variables. This is
First, we calculate the sums of squares that we would for a simple one-way analysis of
variance. There is of course a total sum of squares, calculated from all the scores, using
C, the correction term:
(40)2 1500
C = -^—^ = = 200
8 8
Since there are four groups, there is a sum of squares associated with the means of the
four groups. Simply conceive of the four groups placed side by side as in one-way
analysis of variance, and calculate the sum of squares as in the last chapter. Now, how-
ever, we call this the "between all groups" sum of squares to distinguish it from sums of
squares to be calculated later.
—f + —
36 36
Between all groups =
/
(
—196
— + "^ +
196 \
)
" 200 = 32
This sum of squares is a measure of the variability of all four group means. Therefore,
if we subtract this quantity from the total sum of squares, we should obtain the sum of
squares due to error, the random fluctuations of the scores within the cells (groups). This
is familiar: it is the within-groups sum of squares:
Within groups = 40 - 32 = 8
To calculate the sum of squares for methods, proceed exactly as with one-way analysis
of variance: treat the scores (X's) and sums of scores (2X's) of the columns (methods) as
though there were no Bx and 62:
A, A,
IX: 28 12
Similarly, treat types of motivation (Si and 62) as though there were no methods:
6 6
4 4
2 2
SX: 20 20
The calculation of the between-types sum of squares is really not necessary. Since the
sums (and the means) are the same, the between-types sum of squares is zero:
Between types (20)^ (20)- \
-^—^
/
^^ = -^— ^ - 200 =
(BuBi) ^4 -I-
4 /
Factorial Analysis of Variance • 233
There is another possible source of variance, the variance due to the interaction of the
two independent variables. The between-all-groups sum of squares comprises the varia-
bility due to the means of the four groups: 7, 3, 7, 3. This sum of squares was 32. If this
were not a contrived example, part of this sum of squares would be due to methods, part to
types of motivation, and a remaining part left over, n/i/t7i is due to the joint action, or
interaction, of methods and types. In many cases it would be relatively small, no greater
than chance expectation. In other cases, it would be large enough to be statistically
significant: it would exceed chance expectation. In the present problem it is clearly zero
since the between-methods sum of squares was 32, and this is equal to the between-all-
groups sum of squares. To complete the computational cycle we calculate:'*
little table on the right, there is only one variability, that between the four means. In both
tables, the variability of the four means is the same since they both have the same four
means: 7, 3, 7, 3. Obviously, there is no variability of the B means in both tables. There
Types of
Motivation
234 • Analysis of Variance
A| A2 -4] Aj
fl, 7 3 5 Bi 7t^-.^^.^3
7 <^~>^ 3 5
B2 7 3 5 B2 'i" ^7
3 t^'^-~i7 5
aretwo differences between the tables, then; the A means and the arrangement of the four
means inside the squares. If we analyze the sum of squares of the four means, the be-
tween-all-groups sums of squares, we find that 6, and St contribute nothing to it in both
tables, since there is no variability with 5, 5, the means of B| and Bi- In the table on the
right, the A\ and A2 means of 5 and 5 contribute no variability. In the table on the left,
however, the ApAi means differ considerably, 7 and 3, and thus they contribute variance.
Assuming for the moment that the means of 7 and 3 differ significantly, we can say
that methods of the data of Table 14.3 had an effect irrespective of types of motivation.
That is, Mfy^ ¥= M^,, orM^, > M^,. As far as this experiment is concerned, methods differ
significantly no matter what the type of motivation. And, obviously, types of motivation
had no effect, since Mg^ = Mg,. In Table 14.4, on the other hand, the situation is quite
different. Neither methods nor types of motivation had an effect by themselves. Yet there
is variance. The problem is: What is the source of the variance? It is in the interaction of
prooO-
It is instructive to note, before going further, that interaction can be studied and
calculated by a subtractive procedure. In a 2 x 2 design, this procedure is simple. Sub-
tract one mean from another in each row, and then calculate the variance of these differ-
ences. Take the fictitious means of Table 14.5. If we subtract the Table 14.3 means, we
get 7— 3 = 4; 7 - 3 = 4. Clearly the mean square is zero. Thus, the interaction is zero.
Follow the same procedure for the Table 14.4 means (right-hand side of the table); 7 —
3 = 4; 3 - 7 = -4. If we now treat these two differences as we did means in the last
chapter and calculate the sum of squares and the mean square, we will arrive at the
interaction sum of squares and the mean square, 32 in each case. The reasoning behind
this procedure is simple. If there were no interaction, we would expect the differences
between row means to be approximately equal to each other and to the difference between
the means at the bottom of the table, the methods means, in this case. Note that this is so
for the Table 14.3 means; the bottom-row difference is 4, and so are the differences of
each of the rows. The row differences of Table 14.4, however, deviate from the difference
Factorial Analysis of Variance • 235
Table 14.6 Final Analysis of Variance Tables; Data of Tables 14.3 and 14.4
236 • Analysis of Variance
some of them might be considerably far from 10. The fundamental statistical question is:
Do they differ significantly from 10? The means of combinations of means, too, should
hover around 10. For example, in a design like that of the previous example the A; and At
means should be approximately 10, and the B, and Bt means should be approximately 10.
In addition, the means of each of the cells, Ajfii, A|fi2. A2B1, and A2B2. should hover
around 10.
Using a table of random numbers, 1 drew 60 digits, through 9, to fill the six cells of
a factorial design. The resulting design has two levels or independent variables, A and B.
A is subdivided into A,, A2, and A3, 8 into B; and 82- This is called a 3 x 2 factorial
design. (The examples of Tables 14.3 and 14.4 are 2 x 2 designs.)
Conceive of A as types of appeal. In a social psychological experiment designed to test
hypotheses of the best ways to appeal to prejudiced people to change their attitudes, the
question is asked: What kinds of appeal work best to change prejudiced attitudes?^ As-
sume that three types of appeal, "Religious," "Fair-Play," "Democratic," have been
tried with unclear results. The is more complex,
investigator suspects that the situation
that types of appeal interact with the manner in made. So she sets up a
which appeals are
3x2 factorial design, in which the second variable, 8, is divided into Si and 82, impas-
sioned and calm manner of appeal. That is, the religious appeal is given in an impassioned
manner to some subjects and in a calm manner to others, and similarly for the other two
types of appeal. We will not explore this research problem further, but simply use it to
color the abstract and perhaps skeleton quality of our discussion. Imagine the experiment
to have been done with the results given in Table 14.7, which gives the design paradigm
and the means of each cell, as well as the means of the two variables, A and 8. and the
general mean, M,. These means were calculated from the 60 random numbers drawn in
We hardly need a test of statistical significance to know that these means do not differ
significantly. Their total range is mean expectation, of course, is the mean
3.9 to 5.6. The
of the numbers through 9, 4.5. The closeness of the means to M, = 4.45 or to 4.5 is
remarkable, even for random sampling. At any rate, if these were the results of an actual
experiment, the experimenter would probably be most chagrined. Types of appeal, man-
ner of appeal, and the interaction between them are all not significant.
Notice how many different outcome possibilities other than chance there would be if
one or both variables had been effective. The three means of types of appeal. Ma,. Af.4,,
andM^,, might have been significantly different, with the means of manner, Mg^ and Mb,,
Table 14.7 Two-way Factorial Design; Means of Groups of Random Numbers through 9
Factorial Analysis of Variance • 237
Table 14.8 Means ol Table 14.7 Altered Systematically by Adding and Subtracting Constants
Type of Appeal
Manner of Manner
Appeal i4i A2 A, Means
(4.1 + 2)
fii 6.1 5.0 2.9 4.67
238 • Analysis of Variance
Evidently the alteration of the scores has had an effect. If we were interpreting the results,
as given in Tables 14.8 and 14.9, we would say that, in and of themselves, neither type of
appeal to the bigot nor the manner of appeal differs. But a religious appeal delivered in an
impassioned manner and a democratic appeal in a calm manner seem to be most effective.
Perhaps a bit more clearly, the democratic appeal in an impassioned manner is relatively
ineffectual, as is the religious appeal in a calm manner. (It is not possible to say much
about the fair-play appeal.)
KINDS OF INTERACTION
To now, we have said nothing about kinds of interaction of independent variables in their
joint influence on a dependent variable. To leap to the core of the matter of interactions,
let us lay out several sets of means to show the main possibilities. There are, of course,
many possibilities, especially when one includes higher-order interactions. The six exam-
ples in Table 14.10 indicate the main possibilities with two independent variables. The
first three setups show the three possibilities of significant main effects. They are so
obvious that they need not be discussed. (There is, naturally, another possibility: neither A
nor B is significant.)
When there is a significant interaction, on the other hand, the situation is not so
obvious. The setups (d), and {f) show three common possibilities. In (d), the means
(e),
crisscross, as indicated by the arrows in the table. It can be said that A is effective in one
direction atfi|, but is effective in the other direction atB2 Or, Ai > A2 atB|, but A; < A2
at 82- This sort of interaction with this crisscross pattern is called disordinal interaction
(see below). In this chapter, the fictitious example of Table 14.4 was a disordinal interac-
tion. (See also Table 14.5.) The fictitious example of Table 14.8, where interaction was
deliberately induced by adding and subtracting constants, is another disordinal interac-
tion.
The setupsin (e) and (/), however, are different. Here one independent variable is
effective at one level only of the other independent variable. In (e), A| > A2 at fi,, but
A] = A2 at Bj- In (f), A\ = A2 at 6), but A; > At at fii- The interpretation changes
accordingly. In the case of (e), we would say that A is effective at Bi level, but makes no
difference at Bn level. The case of (/) would take a similar interpretation. Such interac-
tions are called ordinal interactions.
Tabje 14. 10 Various Sets of Means Showing Different Kinds of Main Effects and Interaction
Factorial Analysis of Variance • 239
40
30
20 -
10 -
40
30
20 - B:
10 -
oL
(d) Interaction Significant (e) Interaction Significant
(disordinal) (ordinal)
Figure 14.3
A simple way to study the interaction with a 2 x 2 setup (it is more complex with
more complex models) is to subtract one entry from another in each row, as we did earlier.
If this be done for (a), we get, for rows 6| and B^- 10 and 10. For ib), we get and 0,
and for (c). 10 and 10 again. When these two differences are equal, as in these cases, there
is no interaction. But now try it with (d), (e). and (/"). We get 10 and - 10 for (d), 10 and
for (e). and and 10 for (/^. When these differences are significantly unequal, interac-
tion is present. The student can interpret these differences as an exercise.
It is also possible —
and often very profitable —
to graph interactions, as we did in
Figure 14.1. Set up one independent variable by placing the experimental groups (A A2, , ,
and so on) at equal intervals on the horizontal axis and appropriate values of the dependent
variable on the vertical axis. Then plot, against the horizontal axis group positions (A\,
At, and so on), the mean values in the table at the levels of the other independent variable
(fii. B2. and so on). This method can quite easily be used with 2 x 3, 3 x 3, and other
such designs. The plots of (a), (c), (d), and {e) are given in Figure 14.3.
We can discuss these graphs briefly since both graphs and graphing relations have
been discussed before.* In effect, we ask first if there is a relation between the main
effects (independent variables) and the measures of the dependent variables. Each of these
relations is plotted as in the preceding chapter, except that the relation between one
independent variable and the dependent variable is plotted at both levels of the other
'Extended discussions of interactions can be found in: Edwards, op. cit.. chaps. 9-10. A valuable and clear
discussion of ordinal and disordinal interactions and the virtue of graphing significant interactions is given in: A.
Lubin. "The Interpretation of Significant Interaction," Educational and Psychological Measurement, 21
(19611.807-817.
240 • Analysis of Variance
independent variables; for instance, A is plotted against the dependent variable (vertical
axis) at 61 and Bj- The slope of the lines roughly indicates the extent of the relation. In
each case, we have chosen to plot the relations using A, and At on the horizontal axis. If
the plotted line is horizontal, obviously there is no relation. There is no relation between A
and the dependent variable at levelBn in (e) of Figure 14.3, but there is a relation at level
B]. In (a), there is a relation between A and the dependent variable at both levels, Bj and
Bt. The same is true of (c). The nearer the line comes to being diagonal, the higher the
relation. If the two lines make approximately the same angle in the same direction (that is,
they are parallel), as in (a) and (c), the relation is approximately the same magnitude at
each level. To the extent that the lines make different angles with horizontal axis (are not
parallel), to this extent there is interaction present.
If the graphs of Figure 14.3 were plotted from actual research data, we could interpret
them as follows. Call the measures of the dependent variable (on the vertical axis) Y. In
(a), A is related to makes no difference what B is; Ai and Aj differ
Y regardless of B. It
is no interaction in either (a) or (r). In (d) and (e). however, the case is different. The
graph of ((/) shows interaction. A is related to Y, but the kind of relation depends on B.
Under theB, condition, A, is greater than Ai. But under the Bt condition A^ is greater than
Ai. The graph of (e) says that A is related to Y at level B, but not at level B., or A, is
greater than At at 6, but at Bt they are equal. (Note that it is possible to plot B on the
horizontal axis. The interpretations would differ accordingly.)
NOTES OF CAUTION
Interaction is not always a result of the "true" interaction of experimental treatments.
There One is "true" interac-
are, rather, three possible causes of a significant interaction.
tion, the variance contributed by the interaction that "really" exists between two varia-
bles in their mutual effect on a third variable. Another is error. A significant interaction
can happen by chance, just as the means of experimental groups can differ significantly by
chance. A third possible cause of interaction is some extraneous, unwanted, uncontrolled
effect operating at one level of an experiment but not at another. Such a cause of interac-
tion is particularly to be watched for in nonexperimental uses of the analysis of variance,
that is, in the analysis of variance of data gathered after independent variables have
already operated. Suppose, for example, that the levels in an experiment on methods was
schools. Extraneous factors in such a case can cause a significant interaction. Assume that
the principal of one school, although he had consented to having the experiment run in his
school, was negative in his attitude toward the research. This attitude could easily be
conveyed to teachers and pupils, thus contaminating the experimental treatment, methods.
In short, significant interactions must be handled with the same care as any other research
results. They are interesting, even dramatic, as we have seen. Thus they can perhaps
cause us momentarily to lose our customary caution.^
"A precept that researchers should take seriously is: Whenever possible, replicaie research studies. Repli-
cation should be routinely planned. It is especially necessary when complex relations are found. If an interaction
is found in an original study and it is probably not due to chance, though it could still be due to
in a replication,
other causes. The term "replication" used rather than "repetition" because in a replication, although the
is
original relation is studied again, it might be studied with different kmds of subjects, under somewhat different
conditions, and even with fewer, more, or even different variables. The trend in the psychological research
literature, happily, is to do two or more related studies on the same basic problem. This trend is closely related to
testing alternative hypotheses, whose virtue and necessity were discussed in an earlier chapter. For an excellent
example of replication and multiple studies on the same general problem, see the Jones et al. example cited in
Table 14.11 Example of Disproportion and Unequal Cell n's Arising trom
Nonexperimental Variables"'
Republican Democrat
Male 30 20 SO
Female 20 30 50
50 50
Two related difficulties of factorial analysis are unequal n's in the cells of a design and
the experimental and nonexperimental use of the method. If the h's in the cells of a
not in proportion from row to
factorial design are not equal (and are disproportionate, i.e. ,
tory. Whendoing experiments, the problem is not severe because subjects can be assigned
— —
random except, of course, for attribute variables and the /i"s kept equal
to the cells at
or nearly equal. But in the nonexperimental use of factorial analysis, the h's in the cells
get pretty much beyond the control of the researcher. Indeed, even in experiments, when
more than one categorical variable is included (like race and sex), ns almost necessarily
become unequal.
To understand this, take a simple example. Suppose we split a group in two by sex and
have. say. 50 males and 50 females. A second variable is political preferences and we
want to come up with two equal groups of Republicans and Democrats. But suppose that
sex is correlated with political preference. Then there may be. for example, more males
who are Republican compared to females who are Republican, creating a disproportion.
This is shown in Table 14.11. Now add another independent variable and the difficulties
increase greatly.
What can we do, then, in nonexperimental research? Can't we use factorial analysis of
variance? The answer is complex and is evidently not clearly understood. Factorial analy-
sis of variance paradigms can and should be used because they guide and clarify research.
There are devices for surmounting the unequal n difficulty. One can make adjustments of
the data, or one can equalize the groups by elimination of subjects at random. These are
unwieldy devices, however. The best analytic solution seems to be to use multiple regres-
sion analysis. While the problems do not all disappear, many of them cease to be prob-
lems in the multiple regression framework. In general, factorial analysis of variance is
best suited to experimental research in which the subjects can be randomly assigned to
cells, the n's thus kept equal and the assumptions behind the method more or less satis-
fied. Nonexperimental research or experimental research that uses a number of nonexperi-
mental (attribute) variables is better served with multiple regression analysis. With equal
n's and experimental variables, multiple regression analysis yields exactly the same sums
of squares, mean squares, and F ratios, including interaction F ratios as the standard
factorial analysis. Nonexperimental variables, which are a grave problem for factorial
analysis, are routine in multiple regression analysis. We return to all this in a later chapter.
two variables. A and B. Both F ratios are statistically significant and the interaction F ratio
is not significant. This is straightforward: there is no problem of interpretation. If. on the
other hand. A, or B. or both are significant and the interaction of A and B is also signifi-
cant, there is a problem. Some writers say that the interpretation of significant main
effects in the presence of interaction is not possible and, if done, can lead to incorrect
conclusions. '° The reason is that when one says that a main effect is significant, one may
imply that it is significant under all conditions, that M^^ is greater than M^, with all kinds
of individuals and in all kinds of places, for instance. If the interaction between A and B,
however, is significant, the conclusion is empirically not valid. One has at least to qualify
it: there is at least one condition, namely B. that has to be taken into account. One must
say. instead of the simple If p, then q statement. If p, then q. under condition r, or, for
example. M^ is greater than M^, under condition B| but not under condition Bi- A method
of reinforcement, say praise, is effective with middle-class children but not with working-
class children.
A general rule is that when an interaction is significant, it may not be appropriate to
try to interpret main effects because the main effects are not constant but vary according to
the variables that interact with them. This is especially so if the interaction is disordinal
[see Figure 14.3 (d)], or main effect under study is weak. If a main effect is
if the
strong —
the differences between means are large —
and interaction is ordinal [see Figure
14.3 (e)], then one can perhaps interpret a main effect. Obviously the interpretation of
research data when more than one independent variable is studied is often complex and
difficult. This is no reason to be discouraged, however. Such complexity only reflects the
multivariate and complex nature of psychological, sociological, and educational reality.
The task of science is to understand this complexity. Such understanding can never be
complete, of course, but substantial progress can be made with the help of modem meth-
ods of design and analysis. Factorial designs and analysis of variance are large achieve-
ments that substantially enhance our ability to understand complex psychological, socio-
logical, and educational reality.
bles at one time. For instance, take an experiment with four independent variables. The
smallest arrangement possible is 2 x 2 x 2 x 2, which yields 16 cells into each of which
some minimum number of subjects must be put. If 10 5 's are placed in each cell, it will be
necessary to handle the total of 160 S's in four different ways. Yet one should not be
dogmatic about the number of variables. Perhaps in the next ten years factorial designs
with more than four variables will become common. Indeed, when we later study multiple
regression analysis we of variance can be done with
will find that factorial analysis
multiple regression analysis and that four and five factors are easily accommodated ana-
'"For discussions of interpretation when interactions are significant, see: E. Pedhazur. Multiple Regression
in Behavioral Research, 2d ed. New York; Holt, Rinehart and Winston. 1982, chap. 10. Pedhazur's discussion
is particularly cogent when he attacks the difficulty of interpreting interactions in nonexperimental research.
Factorial Analysis of Variance • 243
Appeals
A) (Religious) A2 (Democratic)
Mode
Manner
244 • Analysis of Variance
Significant first-order interactions are reported more and more in published research
studies. Some years ago they were considered to be rare phenomena. This is quite evi-
dently not so." Indeed, it is now
apparent that interactions of variables are hypothesized
on the basis of theory.'- Part of the essence of scientific theory, of course, is specifying
the conditions under which a phenomenon can and will occur. For example, years ago
Berkowitz was interested in the phenomenon of displacement of aggression.
'-^
When one
is frustrated, one may have an aggressive reaction, says frustration-aggression theory. But
tion. So aggressive urges may be displaced. In Berkowitz's study, it was displaced against
Jews: he found a most interesting interaction between hostility arousal and anti-Semitism.
Anti-Semitic subjects were more likely to respond to frustration with displaced aggression
than less prejudiced subjects. Significant higher-order interactions, while not common, do
occur. The trouble is: They are often hard to interpret. First- and second-order interactions
can be handled, but third- and higher-order interactions make research life uncomfortable
because one is at a loss as to what they mean.'^
By now the reader no doubt realizes that in principle the breakdowns of the indepen-
dent variables are not restricted to just two or three subpartitions. It is quite possible to
have 2 x 4, 2x5, 4x6, 2x3x3, 2x5x4, 4x4x3x5 Laughlinet al.,
"See D. Berliner and L. Cahen. 'Trait-Treatment Interaction and Learning," in F. Kerlinger, ed.. Review
of Research in Educauon. Vol. 1. Itasca, 111.: Peacock. 1973. chap. 3; G. Bracht, "Experimental Factors
Related to Aptitude-Treatment Interactions." Review of Educational Research. 40 (1970), 627-645; L. Cron-
bach and R. Snow. Aptitudes and Instructional Methods: A Handbook for Research on Interaction. New York;
Irvington, 1977. Most of the methodological and substantive preoccupation with interaction in the literature is
in education. It even has a name; ATI (Aptitude-Treatment Interaction) research. Evidently it has flourished
because much or most educational research is preoccupied with improving instruction, and interactions of
pupils' aptitudes and instructional methods are believed to be an important key to doing so.
'-For example, F. Bishop, "The Anal Character; A Rebel in the Dissonance Family," Journal of Personal-
ity and Social Psychology. 6 (1967), 23-36; E. Jones, "Conformity as a Tactic of Ingratiation," Science, 149
(i965), 144-150; C. Ames. R. Ames, and D. Felder, "Effects of Competitive Reward Structure and Valence of
Outcome on Children's Achievement Attnbutions." Journal of Educational Psychology. 69 (1977), 1-8; J.
Cooper and C. Scalise, "Dissonance Produced by Deviations from Life Styles: The Interaction of Jungian
Typology and Conformity," Journal of Personality and Social Psychology. 29 (1974), 566-571.
'•'L. Berkowitz, "Anti-Semitism and the Displacement of Aggression," Journal of Abnormal and Social
of Social Class," Journal of Educational Psychology. 59 1968), 102- 10 (the expenmental treatments worked
( 1
differently with subjects of higher and lower socioeconomic status and higher and lower intelligence).
"P. Laughlin et al., "Concept Attainment by Individuals versus Cooperative Pairs as a Function of Mem-
ory, Sex, and Concept Rule," Journal of Personality and Social Psychology, 8 (1968). 410-417.
Factorial Analysis of Variance • 245
study the effects of both methods and. say, kinds of reinforcement. In psychological
research, we can study the separate and combined effects of many kinds of independent
variables, such as anxiety, guilt, reinforcement, prototypes, types of persuasion, race, and
group atmosphere, on many kinds of dependent variables, such as compliance, conform-
ity, learning, transfer, discrimination, perception, and attitude change. In addition, we
can control variables such as sex, social class, and home environment.
A second advantage is that factorial analysis is more precise than one-way analysis.
Here we see one of the virtues of combining research design and statistical considerations.
It can be said that, other things equal, factorial designs are "better" than one-way de-
signs. This value judgment has been implicit in most of the preceding discussion. The
precision argument adds weight to it and will be elaborated shortly.
.^ third advantage —
and, from a large scientific viewpoint, perhaps the most impor-
tant one —
is the study of the interactive effects of independent variables on dependent
variables. This has been discussed. But a highly important point must be added. Factorial
analysis enables the research to hypothesize interactions because the interactive effects
can be directly tested. If we go back to conditional statements and their qualification, we
see the core of the importance of this statement. In a one-way analysis, we simply say: If
p, then q; If such-and-such methods, then so-and-so outcomes. In factorial analysis,
however, we utter richer conditional statements. We can say: If p. then q and If r, then q,
which is tantamount to talking about the main effects in a factorial analysis. In the prob-
lem of Table 14.4, for instance, p is methods {A) and r is types of motivation (S). We can
also say, however: Upandr. then ly, which is equivalent to the interaction of methods and
types of motivation. Interaction can also be expressed by: If p, then q, under condition r.
On the basis of theory, previous research, or hunch, researchers can hypothesize
interactions. One hypothesizes that an independent variable will have a certain effect only
in the presence of another independent variable. Berkowitz, in the study of anti-Semitism
and displaced aggression cited earlier, asked whether prejudiced persons were more likely
to respond to frustration with displaced aggression than less prejudiced persons. As we
saw, this is an interaction hypothesis. Fart of his results are given in Table 14.13. The
means in the table reflect liking for the partner, which Berkowitz thought would be
affected by the hostility arousal and anti-Semitism. Neither main effect, hostility nor
anti-Semitism, was statistically significant, but the interaction between them was signifi-
cant. When hostility is aroused, evidently, high anti-Semitic subjects responded with
more displaced aggression than low anti-Semitic subjects. The interaction hypothesis was
supported —
a finding of both theoretical and practical significance.'^
"It has become common practice to partition a continuous variable into dichotomies or other polytomies. In
the Berkowitz study, for instance, a continuous measure, anti-Semitism, was dichotomized. It was pointed out
earlier that creating a categorical variable out of a continuous variable throws variance away and thus should be
avoided. We will learn in a later chapter that factorial analysis of variance can be done with multiple regression
analysis and that, with such analysis, it is not necessary to sacrifice any variance by conversion of variables.
Nevertheless, there are countervailing arguments. One, if a difference is statistically significant and the relation
is substantial, the variable conversion does not matter. The danger is in concealing a relation that in fact exists.
Two. there are times when conversion of a variable may be wise — for example, for exploration of a new field or
problem and when measurement of a variable is at best rough and crude. In other words, while the rule is a good
one, it is best not to be inflexible about it. Much good, even excellent, research has been done with continuous
variables that have been partitioned for one reason or another.
246 • Analysis of Variance
Hostility No Hostility
Arousal Arousal
High
Anti-Semitism 18.4 14.2
Low
Anti-Semiiism 12.2 16.3
"The higher the score the less the liking for partner.
variation. We now look more closely. When subjects have been assigned to
at the latter
the experimental groups at random, the only possible estimate of chance variation is the
within-groups variance. But —
and this is important —
it is clear that the within-groups
variance contains not only variance due to error; it also contains variance due to individual
differences among the subjects. Two simple examples are intelligence and sex. There are,
of course, many others. If both girls and boys are used in an experiment, randomization
can be used in order to balance the individual differences that are concomitant to sex.
Then the number of girls and boys in each experimental group will be approximately
equal. We can also arbitrarily assign girls and boys in equal numbers to the groups. This
method, however, does not accomplish the overall purpose of randomization, which is to
It does equalize the groups as far as the sex
equalize the groups on all possible variables.
variable is concerned, but we can have no assurance that other variables are equally
distributed among the groups. Similarly for intelligence. Randomization, if successful,
will equalize the groups such that the intelligence test means and standard deviations of
the groups will be approximately equal. Here, again, it is possible arbitrarily to assign
youngsters to the groups in a way to make the groups approximately equal, but then there
is no assurance that other possible variables are similarly controlled, since randomization
has been interfered with.
Now, let us assume that randomization has been "successful." Then theoretically
there will be no differences between the groups in intelligence and all other variables. But
there will still be individual dijferences in intelligence
—
and other variables within each —
group. With two groups, for instance. Group 1 might have intelligence scores ranging
from, say, 88 to 145, and Group 2 might have intelligence scores ranging from 90 to 142.
This range of scores, in and of itself, shows, just as the presence of boys and girls within
the groups shows, that there are individual differences in intelligence within the groups. If
this be so, then how can we say that the within-groups variance can be an estimate of
error, of chance variation?
The answer is that it is the best we can do under the design circumstances. If the
design is no other measure of error obtainable. So we
of the simple one-way kind, there is
calculate the within-groups variance and treat it as though it were a "true" measure of
error variance. It should be clear that the within-groups variance will be larger than the
"true" error variance, since it contains variance due to individual differences as well as
error variance. Therefore, an F ratio may not be significant when in fact there is "really"
a difference between the groups. Obviously if the F ratio is significant, there is not so
much to worry about, since the between-groups variance is sufficiently large to overcome
the too high estimate of error variance.
To summarize what has been said, let us rewrite an earlier theoretical equation. The
earlier equation was
Since the within-groups variance contains more variance than error variance, the variance
due to individual differences, in fact, we can write
where V, = variance due to individual differences and V^ = "true" error variance. If this
be so, than we can substitute the right-hand side of Equation 14.4 for the V„, in Equation
14.3:
In other words. Equation 14.5 is a shorthand way to say what we have been saying above.
The practical research significance of Equation 14.5 is considerable. If we can find a
way to control ormeasure V„ to separate it from V„„ then it follows that a more accurate
measure of the "true" error variance is possible. Put differently, our ignorance of the
variable situation is decreased because we identify and isolate more systematic variance.
A portion of the variance that was attributed to error is identified. Consequently the
within-groups variance is reduced.
Many of the principles and much of
the practice of research design is occupied with
this problem, which is problem of control
essentially a —
the control of variance. When it
was said earlier that factorial analysis of variance was more precise than simple one-way
analysis of variance, we meant that, by setting up levels of an independent variable, say
sex or social class, we decrease the estimate of error, the within-groups variance, and thus
get closer to the "true" error variance. Instead of writing Equation 14.5, let us now write
a more specific equation, substituting for V',, the variance of individual differences, V^c,
the variance for social class — and reintroducing V^.:
Compare this equation to Equation 14.3. More of the total variance, other than the
between-groups variance, has been identified and labeled. This variance, V,,.. has in effect
been taken out of the V». of Equation 14.3.
RESEARCH EXAMPLES
A large number of interesting uses of factorial analysis of variance have been reported in
recent years in the behavioral research literature. Indeed, one is confronted with an em-
barrassment of riches. A number of examples of
different kinds have been selected to
and strength of the method. We include more examples
illustrate further the usefulness
than usual because of the complexity of factorial analysis, its frequency of use, and its
manifest importance.
In an ingenious and elegantly conceived study, Walster, Cleary, and Clifford asked
whether colleges in the United States discriminate against women and black applicants.
They used a 2 x 2 x 3 factorial design in which race (white, black), sex (male, female),
and ability (high, medium, low) were the independent variables and admission (scored on
a five-point scale, with 1 = rejection through 5 = acceptance with encouragement) was
the dependent variable. They randomly selected 240 colleges from a standard guide and
"E. Walster, T. Cleary. andM. Clifford. "The Effect of Race and Sex on College Admissions," yourna/o/
Educational Sociology, 44 (1970). 237-244.
248 • Analysis of Variance
Table 14.14 Results of Walster, Cleary, and Clifford Study for Sex, Abil-
ity, and Admission (Means)"
Sex
Factorial Analysis of Variance • 249
Group
Tone Experimental Control
"The conditioned stimulus was the 300 cy/sec tone. Interaction F= 68.4, highly
significant.
response during sleep. These rather retnarkable results do not mean, of course, that com-
plex verbal learning can take place during sleep. But evidently learning of at least a
rudimentary kind can. (Note the nice suitability of factorial analysis of variance for the
analytic problem and the applicability of the idea of interaction in this situation.)
The above examples were limited to two independent variables. We now look briefly at a
more complex example with more than two independent variables. The subject of the
research has always been of great interest to educators; reading, scoring, and evaluating
student essays. In what is probably an important study of the problem, Freedman manipu-
lated the content, organization, mechanics, and sentence structure of essays. She rewrote
eight student essays "of moderate quality" to beeither stronger or weaker in the four
characteristics just mentioned. (This was a difficult
task, which Freedman did admirably.)
The essays to be judged included both the original essays and the rewritten essays. The
essays were then evaluated by twelve readers (another variable in the design). The de-
pendent variable was quality, rated on a four-point scale. We have, then, a 2 x 2 x 2 x
2x12 design (the 12 was the 12 judges). The factorial analysis of variance is summa-
rized in Table 14.16.
These results are interesting and potentially important. First, the readers (R) did not
differ, which is as it should be. Second, content and organization were both highly
significant. (The author talks about "the largest main effect," which could have been
Source
250 • Analysis of Variance
better judged by, say or.) Mechanics (M) was also significant; Sentence Structure (SS)
was not. But the x SS and the x M
significant interactions showed that the strength
or weakness of mechanics and sentence structure mattered when essays had strong organi-
zation. This study and its essay assessment are certainly on another level of discourse
from the more or less intuitive and loose methods that most of us use in judging student
writing!
Study Suggestions
Here are some varied and interesting psychological or educational studies that have used
1.
of variance in one way or another. Read and study two of them and ask yourself:
factorial analysis
Was factorial analysis the appropriate analysis? That is, might the researchers have used, say, a
simpler form of analysis?
Anderson, R., Reynolds, R., Schallert. D., and Goetz, E. "Frameworks for Compre-
hending Discourse." American Educational Research Journal. 14 (1977), 367-381. 2 x
2; based on cognitive psychological theory.
Jones, E., Rock, L., Shaver, K., Goethals, G., and Ward, L. "Pattern of Performance and
Ability Attribution: An Unexpected Primacy Effect." Journal of Personality and Social
Psychology. 10 (1968), 317-340. Set of excellent studies on an important psychological
phenomenon. See, especially. Experiment V.
Lancer, E., and Imber, L. "When Practice Makes Imperfect: Debilitating Effects of Over-
learning." Journal of Personalis' and Social Psychology. 37 (1980), 2014-2024. 3x3
and 3x2; unusual findings.
Sigall, H., and Landy, D. "Radiating Beauty: Effects of Having a Physically Attractive
Partner on Person Perception." Journal of Personality and Social Psychology, 28 (1973),
218-224. 3x2; interesting significant interaction.
2. We are interested in testing the relative efficacies of different methods of teaching foreign
languages (or any other subject). We believe that foreign language aptitude is possibly an influential
variable. might an experiment be set up to test the efficacies of the methods? Now add a third
How
variable, sex, and lay out the paradigms of both researches. Discuss the logic of each design from
the point of view of statistics. What statistical tests of significance would you use? What part do
they play in interpreting the results?
3. Write two problems and the hypotheses to go with them, using any three (or four) variables
you wish. Scan the problems and hypotheses in Study Suggestions 2 and 3, Chapter 2, and the
variables given in Chapter 3. Or use any of the variables of this chapter. Write at least one hypothe-
sis that is an interaction hypothesis.
4. From the random numbers of Appendix C draw 40 numbers, through 9, in groups of 10.
Consider the four groups as /\|B|, Aifii, AiBj, Aifii.
(a) Do a factorial analysis of variance as outlined in the chapter. What should the -4, B and
/4 x fi (interaction) F ratios be like?
(b) Add 3 to each of the scores in the group with the highest mean. Which F ratio or ratios
should be affected? Why? Do the factorial analysis of variance. Are your expectations
fulfilled?
5. Some students may wish to expand their reading and study of research design and factorial
analysis of variance. Much has been written, and it is hard to recommend books and articles. There
aretwo books, however, that have rich resources and interesting chapters on design itself, statistical
problems, assumptions and their testing, and the history of analysis of variance and related methods.
Collier, R., and Hummel, T., eds. Experimental Design and Interpretation. Berkeley, Calif.:
McCutchan, 1977. This book was sponsored by the American Educational Research Asso-
ciation.
Kirk, R., ed. Statistical Issues: A Reader for the Behavioral Sciences. Monterey, Calif.:
Brooks/Cole. 1972.
Chapter 15
Analysis of Variance:
Correlated Groups
Suppose team of researchers wants to test the effects of rriarijuana and alcohol on
a
driving." can, of course, set up a one-way design or a factorial design. Instead, the
It
investigators decide to use subjects as their own controls. That is, each subject is to
undergo three experimental treatments or conditions: marijuana (A,), alcohol (A^), and
control (A^). After each of these treatments, the subjects will operate a driving simulator.
The dependent variable measure is the number of driving errors. A paradigm of the design
of the experiment, with a few fictitious scores, is given in Table 15. 1 Note that the sums
.
of both columns and rows are given in the table. Note, too, that the design looks like that
of one-way analysis of variance, with one exception: the sums of the rows, which are the
sums of each subject's scores across the three treatments, are included.
This is quite a different situation from the earlier models in which subjects were
assigned at random to experimental groups. Here all subjects undergo all treatments.
Therefore, each subject is his own control, so to speak. More generally, instead of inde-
pendence we now have dependence or correlation between groups. What does correlation
between groups mean? It is not easy to answer this question with a simple statement.
'
The term "correlated groups" is used because to express the basic and distinctive nature of the
it seems
kind of analysis of variance discussed Other terms more commonly used are "randomized
in this chapter.
blocks" and "repeated measures." Neither of these is completely general, however. See A. Edwards, Experi-
mental Design in Psychological Research. 4lh ed. New York; Holt, Rinehart and Winston, 1972, chap. 14.
"The idea for this example came from an actual research study: A. Crancer et al., "Comparison of the
Effects of Marijuana and Alcohol on Simulated Driving Performance." Science. 164 (1969). 851-854.
252 • Analysis of Variance
1 18 27 16 61
2 24 29 21 74
36 21 25 20 66
more variables involves the same basic idea of correlation between groups.
A Fictitious Example
A principal of a school and the members of his staff decided to introduce a program of
education in intergroup relations as an addition to the school's curriculum. One of the
problems that arose was in the u.se of motion pictures. Films were shown in the initial
phases of the program, but the results were not too encouraging. The staff hypothesized
that the failure of the films to have impact might have resulted from their not making any
particular effort to bring out the possible applications of the film to intergroup relations.
They decided to test the hypothesis that seeing the films and then discussing them would
improve the viewers' attitudes toward minority group members more than would just
seeing the films.
For a preliminary study the staff randomly selected a group of students from the total
'In the example that follows, matching is used to show the applicability of correlated-groups analysis to a
common research situation, because certain points about correlation and its effect can be conveniently made.
Matching as a research device, however, is noi in general advocated for reasons that will be discussed in a later
chapter.
Analysis of Variance: Correlated Groups • 253
really significant as the .01 level. Let us assume that this statement is true; if it is true,
An Explanatory Digression
possible to identify and control more of the total variance of an experimental situation by
setting up levels of one or more variables presumably related to the dependent variable.
The setting up of two or three levels of social class, for example, makes it possible to
identify the variance in the dependent-variable scores due to social class. Now, simply
shift gears a bit. The matching of the present experiment has actually set up ten levels, one
for each pair. The members of the first pair had intelligence scores of 130 and 132, say,
the members of the second pair 124 and 125, and so on to the tenth pair, the members of
which had scores of 89 and 92. Each pair (level) has a different mean. Now, if intelligence
is substantially and positively correlated with the dependent variable, then the dependent
variable pairs of scores should reflect the matching on intelligence. That is, the dependent
variable pairs of scores should also be more like each other than they are like other
dependent-variable scores. So the matching on intelligence has "introduced" variance
between pairs on the dependent variable, or between-rows variance.
Consider another hypothetical example to illustrate what happens when there is correl-
ation between sets of scores. Suppose that an investigator has matched three groups of
subjects on intelligence, and that intelligence was perfectly correlated with the dependent
variable, achievement of some kind. This is highly unlikely, but let's go along with it to
get the idea. The first trio of subjects had intelligence scores of 141, 142, and 140; the
second trio 130, 126, and 128; and so on through the fifth trio of 82, 85, and 82. If we
check the rank orders in columns of the three sets of scores, they are exactly the same:
141, 130 82; 142, 126, 85; 140, 128
. . . , 82. Since we assume that
r =1.00 between intelligence and achievement, then the rank orders of the achievement
scores must be the same in the three groups. The assumed achievement test scores are
given on the left-hand side of Table 15.4. The rank orders of these fictitious scores, from
high to low, are given in parentheses beside each achievement score. Note that the rank
orders are the same in the three groups.
Now suppose that the correlation between intelligence and achievement was approxi-
mately zero. In such a case, no prediction could be made of the rank orders of the
achievement scores, or, to put it another way, the achievement scores would not be
matched. To simulate such a condition of zero correlation, I broke up the rank orders of
the scores on the Ictt-hand side ot'Table 15.4 with the help of a table of random numbers.
After drawing three sets of numbers 1 through 5, 1 rearranged the scores in columns
according to the random numbers. (Before doing this, all the column rank orders were 1,
2, 3, 4, 5.) The first set of random numbers was 2, 5, 4, 3. 1. The second number of
column A was put first. next took the fifth number of A^ and put it second. This process
I 1
was continued until the former first number became the fifth number. The same procedure
was used with the other two groups of numbers, with, of course, different sets of random
numbers. The final results are given on the right-hand side of Table 15.4. The means of
the rows are al.so given, as are the ranks of the column scores (in parentheses).
First, study the ranks of the two sets of scores. In the left-hand portion of the table,
labeled 1. are the correlated scores. Since the ranks arc the same in each column, the
average correlation between columns is .00. The numbers of the set labeled 11, which are
1
essentially random, present quite a different picture. The 15 numbers of both sets are
exactly the same. So are the numbers in each column (and their means). Only the row
numbers —
and, of course, the row means —
are different. Look at the rank orders of II.
No systematic relations can be found between them. The average correlation should be
approximately zero, since the numbers were randomly shuffled. Actually it is .11.
Now study the variability of the row means. Note that the variability of the means of
I is considerably greater than that of II. If the numbers are random, the expectation for the
mean of any row is the general mean. The means of the rows of II hover rather closely
around the general mean of 57.80. The range is 63 - 53 = 10. But the means of the rows
of I do not hover closely around 57.80; their variability is much greater, as indicated by a
range of 73 — 45 = 28. Calculating the variances of these two sets of means (called
between-ruws variance), we obtain 351 .60 for and 58.27 for II. The variance of I is six I
times greater than the variance of II. This large difference is a direct effect of the correla-
tion that is present in the scores of be said that the between-rows
I but not in II. It may
variance is a direct index of individual differences. The reader should pause here and go
over this example, especially the examples of Table 15.4. until the effect of correlation on
variance is clear.
What is the effect of the estimate of the error variance of correlated scores? Clearly the
variance due to the correlation is systematic variance, which must be removed from the
total variance if a more accurate estimate of error variance is desired. Otherwise the error
variance estimate will include the variance due to individual differences and the result will
thus be too large. In the example of Table 15.4. we know that the shuffling procedure has
concealed the systematic variance due to the correlation. By rearranging the scores the
possibility of identifying this variance is removed. The variance is still in the scores of II,
but it cannot be extracted. To show this, we calculate the variances of the error terms of
I and II; that of I is 3.10. that of II, 149.77. By removing from the total variance the
variance due to the correlation, it is possible to reduce the error term greatly, with the
result that the error variance of I is 48 times smaller than the error variance of II. If there is
substantial systematic variance in the sets of measures, then, and it is possible to isolate
and identify this variance, it is clearly worthwhile to do so.
Actual research data will not be as dramatic as the above example. Correlations are
almost never 1 But they are often greater than .50 or .60. The higher the correlation, the
.
larger the systematic variance that can be extracted from the total variance and the more
the error term can be reduced. This principle becomes very important not only in design-
ing research, but also in measurement theory and practice. Sometimes it is possible to
build correlation into the scores and then extract the variance due to the resulting corre-
lated scores. For example, we can obtain a "pure" measure of individual differences by
using the same subjects on different trials. Obviously a subject's own scores will be more
We return to the fictitious research data of Table 15.2 on the effects of films on attitudes
toward minority groups. Earlier we calculated a between-columns sum of squares and
variance exactly as in one-way analysis of variance. We found that the difference between
the means was not significant when this method was used. From the above discussion, we
can surmise that if there is correlation between the two sets of scores, then the variance
due to the correlation should be removed from the total variance and, of course, from the
estimate of the error variance. If the correlation is substantial, this procedure should make
quite a difference: the error term should get considerably smaller. The correlation between
the sets of scores of A^ and At of Table 15.2 is .93. Since this is a high degree of
correlation, the error term when properly calculated should be much lower than it was
before.
The additional operation required is simple. Just add the scores in each row of Table
15.2 and calculate the between-rows sum of squares and the variance. Square the sum of
each row and divide the result by the number of scores in the row; for example, in the first
row: 8 + 6 = 14; (14)-/2 = 196/2 = 98. Repeat this procedure for each row, add the
quotients, and then subtract the correction term C. This yields the between-rows sum of
squares. (Since the number of scores in each row is always 2. it is easier, especially with a
desk calculator, to add all the squared sums and then divide by 2.)
This between-rows sum of squares is a measure of the variability due to individual differ-
Briefly, the total variance has been broken down into two identifiable or systematic
variances and one error variance. And this error variance is a more accurate estimate of
error or chance variation of the scores than that of Table 15.3.
Rather than substitute in the equation, we set up the final analysis of variance table
(Table 15.5). The F ratio of the columns is now 20. 00/. 89 = 22.47, which is significant at
the .001 level. In Table 15.3 the F ratio was not significant.
This is quite a difference. Since the is the same, the differ-
between-columns variance
ence is due to the greatly decreased error term, now .89 when it was 10.56 before. By
calculating the rows sum of squares and the variance, it has been possible to reduce the
error term to about 1/12 of its former magnitude. In this situation, obviously, the former
error variance of 10.56 was greatly over-inflated. Returning to the original problem, it is
Analysis of Variance: Correlated Groups • 257
Source
258 • Analysis of Variance
numbers in these sets are exactly the same; only their arrangements differ. In I, there is
Table 15.8 Removal of Between-columns Variance by Equalizing Column Means and Scores
Analysis of Variance: Correlated Groups • 261
Source df sj ms F
Total 79 9216.20
gists have criticized learning theorists and other psychological investigators for using
animals in their research. While there can be legitimate criticism of psychological and
other behavioral research, criticizing it because animals are used is part of the frustrating
but apparently unavoidable irrationality that plagues all human effort. Yet. it does have a
certain charm and can itself be the object of scientific investigation.'^ In any case, one of
the reasons for testing similar hypotheses with different species is the same reason we
replicate research in different parts of the United States and in other countries: generality.
How much more powerful a theory is if it holds up with southerners, northerners, eastern-
ers, and westerners, with Germans, Japanese, Israelis, and Americans and with rats, —
pigeons, horses, and dogs. Morrow and Smithson's study attempted to extend learning
theory to little creatures whose learning one might think to be governed by different laws
than the learning of men and rats. They succeeded —
to some extent at least.
They trained eight isopods, through water deprivation and subsequent reinforcement
for successful performance (wet blotting paper), to make reversals of their "preferences"
for one or the other arm of a T maze. When the 5"s had reached a specified criterion of
correct turns in the maze, the training was reversed — that is, turning in the direction of the
other arm of the T maze was reinforced until the criterion was reached. This was done
with each isopod for nine reversals. The question is: Did the animals learn to make the
reversals sooner as the trials progressed? Such learning should be exhibited by fewer and
fewer errors.
Morrow and Smithson analyzed the data with two-way analysis of variance. The mean
number of errors of the initial trial and the nine reversal trials consistently got smaller:
27.5, 23.6, 18.6, 14.3, 16.8, 13.9, 11.1,8.5, 8.6, 8.6. The two-way analysis of variance
'° The ten means differ significantly, since the F ratio for
table is given in Table 15.1 1.
columns (reversal trials), 4.78, is significant at the .01 level. That there is correlation
between the columns, and thus individual differences among the isopods, is shown by the
F ratio for rows, 3.15, also significant at the .01 level. It is a piquant note that even little
crustaceans are individuals!
Study Suggestions
1. Do two-way two sets of fictitious data of Table 15.6. Use the text
analysis of variance of the
as an aid. Interpret the results. Now
do two-way analysis of variance of the two sets of Table 15.8;
do the same for Table 15.9. Lay out the fmal analysis of variance tables and compare. Think through
carefully how the adjustive corrections have affected the original data.
2. Three sociologists were asked to judge the general effectiveness of the administrative offices
'Bugelski has written an excellent defense of the use of rats in learning research that students of behavioral
research should read: B. Bugelski, The Psychology of Learning. New York: Holt. Rinehart and Winston, 1956,
pp. 33-44 Another excellent essay on a somewhat broader base is: D. Hebb and W. Thompson, "The Social
Significance of Animal Studies." InG. Lindzey and E. Abelson. eds.. The Handbook of Social Psychology. 2d
ed. Reading, Mass.; Acjdison-Wesley, 1968, vol. II, pp. 729-774.
'"1 did the analysis of variance from the original data given by Morrow and Smithson in their Table 1.
Analysis of Variance: Correlated Groups • 263
of ten elementary schools in a school district. One of their measures was administrative flexibility.
(The higher the score the greater the flexibility.) The ten ratings on this measure of the three
sociologists are given below:
i", i": 5.,
1
264 • Analysis of Variance
Nonparametric Analysis
of Variance and
Related Statistics
It is. of course, possible to analyze data and to draw inferences about relations among
variables without statistics. Sometimes, for example, data are so obvious that a statistical
test is not really necessary. If all the scores of an experimental group are greater than (or
less than) those of a control group, then a statistical test is superfluous. It is also possible
to have statistics of a quite different nature than those we have been studying, statistics
that use properties of the data other than the strictly quantitative. We can infer an effect of
X on Y if the scores of an experimental group are mostly of one kind, say high or low, as
contrasted to the scores of a control group. This is because, on the basis of randomization
and chance, we expect about the same numbers of different kinds of scores in both
experimental and control groups. Similarly, if we arrange all the scores of experimental
and control groups in rank order, from high to low, say, then on the basis of chance alone
we expect the sum or average of the ranks in each group to be about the same. If they are
not, if the higher or the lower ranks tend to be clustered in one of the groups, then we infer
that "something" other than chance has operated.
Indeed, there are many ways to approach and analyze data other than comparing
means and variances. But the basic principle is always the same if we continue to work in
a probabilistic world: compare obtained results to chance or theoretical expectations. If,
for example, we administer four treatments to subjects and expect that one of the four will
excel the others, we can compare the mean of the favored group with the average of the
other three groups in an analysis of variance and planned comparisons manner. But sup-
pose our data are highly irregular in one or more ways and we fear for the validity of the
usual tests of significance. What can we do? We can rank order all the observations, for
one thing. If none of the four treatments has any more influence than any other, we expect
266 • Analysis of Variance
the ranks to disperse themselves among the four groups more or less evenly. If treatment
A2. however, has a preponderance of high (or low) ranks, than we conclude that the usual
expectation is upset. Such reasoning is a good part of the basis of so-called nonparametric
and distribution-free statistics.'
tics, but especially nonparametric analysis of variance, and to bring out the essential
similarity of most inference-aiding methods.
The student should be aware that careful study of nonparametric statistics gives depth
of insight into statistics and statistical inference. The insight gained is probably due to the
considerable loosening of thinking that seems to occur when working tangential to the
usual statistical structure. One sees, so to speak, a broader perspective; one can even
invent statistical tests, once the basic ideas are well understood. In short, statistical and
inferential ideas are generalized on the basis of relatively simple fundamental ideas.
of the sample population or the values of the population parameters. For example, nonpar-
ametric tests do not depend on the assumption of normality of the population scores. The
problem of assumptions is difficult, thorny, and controversial. Some statisticians and
researchers consider the violation of assumptions a serious matter that leads to invalidity
of parametric statistical tests. Others believe that, in general, violation of the assumptions
is not so serious because tests like the F and t tests are robust, which means, roughly, that
they operate well even under assumption violations, provided violations are not gross and
multiple."^ Nevertheless, let's examine three important assumptions and the evidence for
believing parametric methods to be robust. We also discuss a fourth assumption, indepen-
dence of observations, because of its generality —
it applies no matter what kind of statisti-
'
There is no single name for the statistics we are discussing. The two most appropriate names are "nonpar-
ametric statistics" and "distribution-free statistics." The latter, for instance, means that the statistical tests of
significance make no assumptions about the precise form of the sampled population. See J. Bradley, Distribu-
tion-Free Statistical Tests.Englewood Cliffs, N. J.: Prentice-Hall. 1968, chap. 2. In this book we will use
"nonparametric statistics" to mean those statistical tests of significance not based on so-called classical statisti-
cal theory, which is based largely on the properties of means and variances and the nature of distributions.
"For an excellent general treatment of the problems and an encouraging review of the robustness issue, see
P.Gardner, "Scales and Statistics," Review of Educational Research. 45 (1975), 43-57. For quite a different
and discouraging view, see Bradley, op. cit. chap. 2. or J. Bradley, "Nonparametric Statistics," in R. Kirk,
ed.. Statistical Issues: A Reader for the Behavioral Sciences. Monterey. Calif.: Brooks/Cole, 1972. Sel. 9.1.
Glass and colleagues, in another treatise on the subject, also take a dim view of assumption violations: G. Glass.
P. Peckham. and J. Sanders. "Consequences of Failure to Meet Assumptions Underlying the Fixed Effects
Analysis of Variance and Covariance." Review of Educational Research. 42 1972). 237-288. In sum. Gardner
(
says to go ahead and use parametric statistics, whereas Bradley advocates nonparametric methods. The difficulty
isthat both arguments are compelling and valid! 1 lean toward Gardner's position. If one uses reasonable care in
sampling and analysis and circumspection in interpretation of statistical results, parametric methods are useful,
valuable, and irreplaceable. Nonparametric methods are useful adjuncts in the statistical armamentarium of the
researcher, but they can by no means replace parametric methods.
Nonparametric Analysis of Variance and Related Statistics •
267
Assumption of formality
The best-known assumption behind the use of many parametric statistics is the assumption
Dfiiiirmality. It is assumed in using the and F tests (and thus the analysis of variance), for
/
example, that the samples with which we work have been drawn from populations that are
nomially distributed. It is said that, if the populations from which samples are drawn are
not normal, then statistical tests that depend on the normality assumption are vitiated. As
a result, the conclusions drawn from sampled observations and their statistics will be in
question. When in doubt about the normality of a population, or when one knows that the
population is not normal, one should use a nonparametric test that does not make the
normality assumption, it is said. Some teachers urge students of education and psychology
to use only nonparametric tests on the questionable ground that most educational and
psychological populations are not normal. The issue is not this simple.
Homogeneity of Variance
that populations are rather seriously nonnormal and that variances are heterogeneous, it is
usually unwise to use a nonparametric statistical test in place of a parametric one. The
reason for this is that parametric tests are almost always more powerful than nonparamet-
ric tests. (The power of a statistical test is the probability that the null hypothesis will be
rejected when it is actually false.) There is one situation, or rather, combination of situa-
tions, that may be dangerous. Boneau found that when there was heterogeneity of vari-
ance and differences in the sample sizes of experimental groups, significance tests were
adversely affected.
^Two important studies were done by Norton and by Boneau. Lindquist gives an admirable summary of the
Norton study: E. Lindquist, Design and Analysis of Experiments Boston: Houghton Mifflin, 1953, pp. 78-86.
.
Boneau discusses the whole problem of assumptions and reports his own definitive study in a brilliant article: C.
Boneau. "The Effects of Violations of Assumptions Underlying the ( Test." Psychological Bulletin. 57 1960), (
49-64. Another useful article by Boneau is C. Boneau. "A Note on Measurement Scales and Statistical Tests,"
American Psychologist, 16 1961 1, 260-261 An excellent but more general article is N. Anderson. "Scales and
( .
Statistics: Parametric and Nonparametric." Psychological Bulletin. 58 (1961), 305-316. Additional empirical
demonstrations of the robustness of analysis of variance and the ; test are: P. Games and P. Lucas, "Power of
the Analysis of Variance of Independent Groups on Non-Normal and Normally Transformed Data," Educational
and Psychological Measurement. 26 (1966). 311-327; B. Baker, C. Hardyck. and I. Petronovich, "Weak
Measurements vs. Strong Statistics: An Empirical Critique of S. S. Stevens" Proscriptions on Statistics."
Educational and Psychological Measurement. 26 (1966), 291-309. It has also been found that tests of the
significance of coefficients of correlation are insensitive to extreme violations of the assumptions of normality
and measurement scale: L. Havlicek and N. Peterson, "Effect of the Violation of Assumptions Upon Signifi-
cance Levels of the Pearson r." Psychological Bulletin. 84 (1977), 373-377.
268 • Analysis of Variance
A third assumption is that the measures to be analyzed are continuous measures with equal
intervals. As we shall see in a later chapter, this assumption is behind the arithmetic
operations of adding, subtracting, multiplying, and dividing. Parametric tests like the F
and t tests of course depend on this assumption, but many nonparametric tests do not. This
assumption's importance has also been overrated. Anderson has effectively disposed of
it,'* and Lord has lampooned it in a well-known article on football numbers.'
Despite the conclusions of Lindquist, Boneau, Anderson, and others, it is well to bear
these assumptions in mind. It is not wise to use statistical procedures or, for that matter, —
any kind of research procedures —
without due respect for the assumptions behind the
procedures. If they are too seriously violated, the conclusions drawn from research data
may be in error. To the reader who has been alarmed by some statistics books the best
advice probably is; Use parametric statistics, as well as the analysis of variance, routinely,
but keep a sharp eye on data for gross departures from normality, homogeneity of vari-
ance, and equality of intervals. Be aware of measurement problems and their relation to
statistical tests, and be familiar with the basic nonparametric statistics so that they can be
used when necessary. Also bear in mind that nonparametric tests are often quick and easy
to use and are excellent for preliminary, if not always definitive, tests.
Independence of Observations
Another assumption that is important in both measurement and statistics is that of inde-
pendence of observations —
also called statistical independence. We have already studied
statistical independence in Chapter 7, where we examined independence, mutual exclu-
siveness, and exhaustiveness of events and their probabilities. (The reader is urged to
review that section of Chapter 7.) We reexamine independence here, however, in the
context of statistics because of the special importance of the principle involved. The
independence assumption applies on both parametric and nonparametric statistics. That is,
one cannot escape its implications by using a different statistical approach that does not
involve the assumption.
The formal definition of statistical independence is: If two events. Ax and A2, are
statistically is: p{A] H Ai) = piA^)
independent, the probability of their intersection •
p{A2)-^ example, a student takes a test of ten items, the probability of getting any
If, for
item correct by chance (guessing) is j. If the items and the responses to them are inde-
pendent, then the probability of getting, say. items two, three, and seven correct by
chance is: s x | x ^ = .125. And similarly for all ten items: .001.
assumed in research that observations are independent, that making one observa-
It is
tion does not influence the making of another observation. If I am observing the coopera-
tive behavior of children, and I note that Anna seems to very cooperative, then 1 am likely
to violate the independence assumption because I will expect her future behavior to be
cooperative. If, indeed, the expectation operates, then my observations are not inde-
pendent.
Statistical tests assume independence of the observations that yield the numbers that
'F. Lord, "On the Statistical Treatment of Football Numbers." American Psychotogisi. 8 (1953), 750-
751. Both the Lord and the Anderson articles are in Kirk. op. cii.. selections 2.3 and 2.4.
'W. Feller. An Introduction to Probabilin Theor\' and Its Applications, vol. I. New York: Wiley, 1950.
p. 115.
Nonparametric Analysis of Variance and Related Statistics • 269
go into the statistical calculations. If the observations are not independent, arithmetic
operations and statistical tests are vitiated. For example, if item 3 in the ten-item test
really cimtaiiied the correct answer to item 9, then the responses to the two items will not
be independent. The probability of getting all ten items right by chance is altered. Instead
of .001 . the probability is some larger figure. The calculation of means and other statistics
will be contaminated. Violation of this assumption seems to be fairly common probably
because it is easy to do.
In Chapter 7 we encountered an interesting and subtle example of violation of the
assumption when we reproduced a table (Table 7.3) whose entries were aggressive acts
rather than the numbers of animals who acted aggressively. If we have a crossbreak
tabulation of frequencies and calculate x- to determine whether the cell entries depart
significantly from chance expectation, the total N must be the total number of units in the
sample, whether the units are individuals or some sort of aggregate (like groups), the units
having been independently observed. The N's of statistical formulas assume that sample
sizes are the numbers of units of the calculation, each unit being independently observed.
If, for example, one has a sample of 16 subjects, then A' = 16. Suppose one had
observed varying acts of some of the subjects and entered the frequencies of occurrence of
these acts. Suppose, further, that a total of 54 such acts was observed and 54 was used as
A'. This would be a gross violation of the independence of observations assumption. In
short, the entries in frequency tables must be the numbers of independent observations.
One cannot count several occurrences of a kind of event from one person. If N is the
number of persons, then it cannot become the number of occurrences of events of the
persons. This is a subtle and dangerous point. The statistical analyses of a number of
published studies suffer from violation of the principle. I have even seen a factorial
analysis of variance table in which the tabled entries were numbers of occurrences of
certain events and not the true units of analysis, the individuals of the sample. The
difficulty is not so much that violation of independence is immoral. It is a research
delinquency because it can lead to quite erroneous conclusions about the relations among
variables.
their means) in the three columns should be approximately equal. ^ On the other hand, if
^Kendall has ingeniously shown how it is appropriate to add ranks; M. Kendall, Rank Correlation Meth-
ods. London: Griffin, 1948, p. I.
270 • Analysis of Variance
order, or ordinal nieasureinent. The Kruskal and Wallis test is most useful in such situa-
tions. But it is also useful when data are irregular but amenable to ranking.
In situations in which subjects are matched or the same subjects are observed more than
'"
once, a form of rank-order analysis of variance, first devised by Friedman, can be used.
Professors
272 • Analysis of Variance
of each composite rating. We also focus on the sums of the ranks at the bottom of the
table.
The formula given by Friedman is:
where Xr^ = x^- ranks; k = number of rankings; n = number of objects being ranked;
IRj = sum of the ranks in column (group) y; and LR,- = sum of the squared sums. First
calculate ST?/:
Now determine k and n. The number of rankings is k, or the number of times that the
rank-order system, whatever it is, is used. Here ^ = 6. The number of objects being
ranked, n, or the number of ranks, is 3. (Actually, the raters are not being ranked: 3 is the
number of ranks in the rank-order system being used.) Now calculate Xr'-
12
474 - (3)(6)(4) = 19-12 = 7
(6)(3)(4)
12
787 - (3)(3)(7) = 11.95
(3)(6)(7)
Professors
Nonparatnetric Analysis of Variance and Related Statistics • 273
three groups were found to be significantly different at the .05 level. In the case of the
significance of the differences between the professors, the analysis also showed signifi-
cance. In general, the methods should agree fairly well.'"
12s
k (n - n)
S is the sum of the deviations squared of the totals of the n ranks from their mean. 5 is
a between-groups sum of squares for ranks. It is like ssf,. (In fact, if we divide S by k, S/k,
we obtain the between sum of squares we would obtain in a complete analysis of variance
of the ranks.)
5 = (5- -I- 6- + • •
+ 15-) - (63)-/6 = 787 - 661.5 = 125.5
Since k = 3 and n = 6,
X
W= —^-^
12
-
3-(6-'
125.50
6)
=
9(216
1506
- 6)
= 1506
1890
= .797 = .80
'"Using another method of analysis of variance based on ranges rather than variances, the results of the
Friedman test are confirmed. This method, called the sludeniized range test, is useful. For details, see E.
Pearson and H. Hartley, eds., Biomerrika Tables for Statisticians. Cambridge: Cambridge University Press,
1954. vol. I, pp. 5 1-54 and 176-179. Ranges are good measures of variation for small samples but not for large
samples. The principle of the studentized range test is similar to that of the Ftest in that a "within-groups range"
is used to evaluate the range of the means of the groups. Another useful method, that of Link and Wallace, is
described in detail in Mosteller and Bush. op. cit. pp. 304-307 Both methods have the advantage that they can
. ,
be used with one-way and two-way analyses. Still another method, which has the unique virtue of testing an
ordered hypothesis of the ranks, is the L test: E. Page, "Ordered Hypotheses for Multiple Treatments: A
Significance Test for Linear Ranks," Journal of the American Statistical Association. 58 (1963), 216-230.
''Kendall, op. cit., chap. 6.
274 • Analysis of Variance
If A- and n are small, appropriate tables of 5 can be used.''* W= .80 is statistically signifi-
cant at the .01 level. The relation is high: evidently there is high agreement of the three
groups in their rankings of the professors.
A t test of the difference between two means can be made with the following formula:
Xi — X2
t, =
2(/?l + ^2)
where i^ = estimated ?,•/?, = range of group 1, and Rj = range of group 2 (see MB,
p. 324).
Another property of data is what can be called periodicity. If there are different kinds
of events (heads and tails, male and female, religious preference, etc.), and numerical
data from different groups are combined and ranked, then by chance there should not be
long runs of any particular event, like a long run of females in one experimental group.
The runs test is based on this idea.
Still another property of data was discussed in Chapter 11: distribution. The distribu-
tions of different samples can be compared with each other or with a "criterion"" group
normal distribution) for deviations. The Kolmogorov-Smimov test (S, pp. 47-
(like the
52, 127-136) tests goodness of fit of the distributions. It is a useful test, especially for
small samples.
The most ubiquitous property of data, perhaps, is rank order. Whenever data can be
ranked, they can be tested against chance expectation. Many, perhaps most, nonparamet-
ric tests are rank-order tests. The Kruskal-Wallace and the Friedman tests are, of course,
both based on rank order. Rank-order coefficients of correlation are extremely useful. W
isone of these. So are the Spearman rank-order coefficient of correlation (S, pp. 202-
213) and KendalFs tau (S, pp. 213-223).
principle emphasized again and again — perhaps a bit tediously: Assess obtained results
against chance expectation. There is no magic to nonparametric methods. No divine
benison has been put on them. The same probabilistic principles apply.
Another point made earlier needs repetition and emphasis: Most analytic problems of
behavioral research can be adequately handled with parametric methods. The F test, t test,
and other parametric approaches are robust in the sense that they perform well even when
the assumptions behind them are violated —
unless, of course, the violations are gross or
multiple. Nonparametric methods, then, are highly useful secondary or complementary
techniques that can often be valuable in behavioral research. Perhaps most important, they
again show the power, flexibility, and wide applicability of the basic precepts of probabil-
ity and the phenomenon of randomness enunciated in earlier chapters.
Study Suggestions
1. A teacher interested in studying the effect of workbooks decides to conduct a small experi-
ment with her class. She randomly divides the class into 3 groups of 7 pupils each, calling these
groups, A\, Aj, and Aj. A, was taught without any workbooks, Ai was taught with the occasional
use of workbooks at the teacher's direction, and A3 was taught with heavy dependence on work-
books. At the end of four months, the teacher tested the children in the subject matter. The scores
she obtained were in percentage form, and she thought that it might be questionable to use paramet-
ric analysis of variance."' So she used the Kruskal-Wallis method. The data are as follows:
A, At Aj
.55
276 • Analysis of Variance
PART SIX
Designs of
research
Chapter ii
Research Design:
Purpose and Principles
Research design is the plan and structure of investigation so conceived as to obtain an-
swers to research questions. The plan is the overall scheme or program of the research. It
includes an outline of what the investigator will do from writing the hypotheses and their
operational implications to the final analysis of data. The structure of research is harder to
explain because "structure" is difficult to define clearly and unambiguously. Since it is
a concept that becomes increasingly important as we continue our study, we here break off
and attempt a definition and a brief explanation. The discourse will necessarily be some-
what abstract at this point. Later examples, however, will be more concrete. More impor-
tant, we will find the concept powerful, useful, even indispensable, especially in our later
study of multivariate analysis where "structure" is a key concept whose understanding is
'The words "structure," "model." and "paradigm" are troublesome because they are hard to define
clearlyand unambiguously. A "paradigm" is a model, an example. Diagrams, graphs, and verbal outlines are
paradigms. We use "paradigm" here rather than "model" because "model" has another important meaning in
meaning we return to in Chapter 36 when we discuss the testing of theory using multivariate procedure
science, a
and "models" of aspects of theories.
280 • Designs of Research
depends on how the observations and the inference were made. Adequately planned and
executed design helps greatly in permitting us to rely on both our observations and our
inferences.
How does design accomplish this? Research design sets up the framework for study of
the relationsamong variables. Design tells us, in a sense, what observations to make, how
to make them, and how to analyze the quantitative representations of the observations.
Strictly speaking, designdoes not "tell" us precisely what to do, but rather "suggests"
the directions of observation-making and analysis. An adequate design "suggests," for
example, how many observations should be made, and which variables are active and
which are attribute. We can then act to manipulate the active variables and to categorize
and measure the attribute variables. A design tells us what type of statistical analysis to
use. Finally, an adequate design outlines possible conclusions to be drawn from the
statistical analysis.
An Example
It has been said that colleges and universities discriminate against women
and in in hiring
ment as follows. To a random sample of 200 colleges we send applications for admission
basing the applications on several model cases selected over a range of tested ability with
all details the same except for sex. Half the applications will be those of men and half
women. Other things equal, we expect approximately equal numbers of acceptances and
'The idea example to be used came from the unusual and ingenious experiment cited earlier;
for the
and M. Clifford, "The Effect of Race and Sex on College Admission." Sociology of
E. Walster, T. Cleary.
Education. 44 (1971), 237-244.
Research Design: Purpose and Principles • 281
Treutnients
(Male)
282 • Designs of Research
Sex
Abilin-
Research Design: Purpose and Principles • 283
concomitantly is to point out ways to avoid these problems. If design and statistical
analysis are planned simultaneously, the analytical work is usually straightforward and
uncluttered.
A highly useful dividend of design is this; A clear design, like that in Figure 17.2,
suggests the statistical tests that can be made. A simple one- variable randomized design
with two partitions, for example, two treatments, A[ and Ai. permits only a statistical test
of the difference between the two statistics yielded by the data. These statistics might be
two means, two medians, two ranges, two variances, two percentages, and so forth. Only
one statistical test is ordinarily possible. With the design of Figure 17.2, however, three
statistical tests are possible: (I) between A and A^; (2) among Bx, Bi, and 63; and (3) the
i
interaction of A and B. In most investigations, all the statistical tests are not of equal
importance. The important ones, naturally, are those directly related to the research prob-
lems and hypotheses.
In the present case the interaction hypothesis [or (3), above] is the important one,
since the discrimination is supposed to depend on ability level. Colleges may practice
discrimination at different levels of ability. As suggested above, females {Aj) may be
accepted more than males (A|) at the higher ability level (B|), whereas they may be
accepted less at the lower ability level (B3).
It should be evident that research design is not static. A knowledge of design can help
us to plan and do better research and can also suggest the testing of hypotheses. Probably
more important, we may be led to realize that the design of a study is not in itself adequate
to the demands we are making of it. What is meant by this somewhat peculiar statement?
Assume that we formulate the interaction hypothesis as outlined above without know-
ing anything about factorial design. We set up a design consisting, actually, of two
experiments. In one of these experiments we test A; against At under condition Si In the .
second experiment we test A| against Aj under condition 62- The paradigm would look
like that of Figure 17.3. (To make matters simpler, we are only using two levels of B, S|
and B3, but changing 63 to Bj- The design is thus reduced to 2 x 2.)
The important point to note is that no adequate test of the hypothesis is possible with
this design. A| can be tested against A2 under both fii and 62 conditions, to be sure. But it
is not possible to know, clearly and unambiguously, whether there is a significant interac-
tion between A and B. Even if M^^ > M^^JBj (M^^ is greater than M^,. under condition
B2). as hypothesized, the design cannot provide a clear possibility of confirming the
hypothesized interaction since we cannot obtain information about the differences be-
tween A and A2
1 at the two and B2. Remember that an interaction hypothe-
levels of B, B]
sis implies, in this case, that the difference between A andA2 is different atB, from what
1
A,
284 • Designs of Research
Research Design; Purpose and Principles • 285
not so straightforward.
There is body of belief and research that indicates that college students
a substantial
what has been called mastery learning. Very briefly "mastery
learn well under a regime of
learning" means a system of pedagogy based on personalized instruction and requiring
students to learn curriculum units to a mastery criterion.^ If there is substantial research
supporting the efficacy of mastery learning, there one study and a fine study it
is at least —
is —
whose results indicate that students taught with the approach do no better than stu-
dents taught with a conventional approach of lecture, discussion, and recitation.** Contro-
versy enters the picture because mastery learning adherents seem so strongly convinced of
its virtues, while its doubters are almost equally skeptical. Will research decide the mat-
ter'.' Hardly. But let's see how one might approach a relatively modest study capable of
yielding at least a partial empirical answer.
An educational investigator decides to test the hypothesis that achievement in science
is enhanced more by a mastery learning method (ML) than by a traditional method (T).
We ignore the details of the methods and concentrate on the design of the research. Call
the mastery learning method Ai and the traditional method Ai. The investigator knows
that other possible independent variables influence achievement; intelligence, sex, social
class background, previous experience with science, motivation, and so on. He has reason
to believe that the two methods work differently with different kinds of students. They
may work example, with students of different scholastic aptitudes. The
differently, for
traditional approach is effective, perhaps, with students of high aptitude, whereas mastery
learning is more effective with students of low aptitude. Call aptitude B: high aptitude is
B\ and low aptitude B:-'
What kind of design should be set up? To answer this question it is important to label
the variables and to know clearly what questions are being asked. The variables are:
The investigator may also have included other variables in the design, especially variables
on achievement; general intelligence, social class, sex, high school
potentially influential
average, for example. He decides, however, that random assignment will take care of
intelligence and other possible influential independent variables. His dependent variable
measure is provided by a standardized science knowledge test.
The problem seems to call for a factorial design. There are two reasons for this choice.
''T. Amabile, "Effects of External Evaluation on Artistic Creativity." Journal of Personality and Social
Psychology. 37 (1979). 221-233.
'For a review, see J. Block andR. Bums. "Mastery Learning." In L. Shulman.ed, Review of Research in
Education, vol. 4. Itasca, 111.: Peacock, 1976. The mastery learning approach is more complex than the above
Methods
M.,fl,
(High Aptitude)
Aptitude
B2
(Low Aptitude)
Research Design: Purpose and Principles • 287
variances present in the total variance of the dependent variable measures are due to the
manipulation and control of the independent variables and to error. Now, back to our
principle.
ing its effect from the total variance of the dependent variable. It is necessary to give the
variance of a relation a chance to show itself, to separate itself, so to speak, from the total
variance, which is a composite of variances due to numerous sources and chance. Re-
membering this subprinciple of the maxmincon principle, we can write a research precept:
Design, plan, and conduct research so that the experimental conditions are as different as
possible.^'
In the present research example, this subprinciple means that the investigator must
take pains to make A and Aj.
the mastery learning and traditional methods, as different as
1
possible. Next, he must so categorize B\ and Bn that they are different on the aptitude
dimension. This latter problem is essentially one of measurement, as we will see in a later
chapter. In an experiment, the investigator is like a puppeteer making the independent
variable puppets do what he wants. He holds the strings of the A| and A2 puppets in his
right hand and the strings of the B, and 62 puppets in his left hand. (We assume there is
no influence of one hand on the other, that is, the hands must be independent.) He makes
the A] and Aj puppets dance apart, and he makes the 6, and B2 puppets dance apart. He
then watches his audience (the dependent variable) to see and measure the effect of his
manipulations. If he is successful in making A| and A2 dance apart, and if there is a
relation between A and the dependent variable, the audience reaction if separating Aj —
and A2 is funny, for instance —
should be laughter. He may even observe that he only gets
laughter when A| and Aj dance apart and, at the same time, B| or B2 dance apart (interac-
tion again).
'"There are. of course, exceptions to this subprinciple. but they are probably rare. An investigator might
want to study the effects of small gradations of, say. motivational 1 centives on the learning of some subject
matter. Here hewould not make his experimental conditions as different as possible. Still, he would have to
make them vary somewhat or there would be no discernible resulting variance in the dependent varialle.
288 • Designs of Research
virtually eliminated by using subjects of only one intelligence level, say intelligence
scores within the range of 90 to 110. If we are studying achievement, and racial member-
ship is a possible contributing factor to the variance of achievement,
it can be eliminated
by using only members of one race. TheTo eliminate the effect of a possible
principle is:
influential independent variable on a dependent variable, choose subjects so that they are
as homogeneous as possible on that independent variable.
This method of controlling unwanted or extraneous variance is very effective. If we
select only one sex for an experiment, then we can be sure that sex cannot be a contribut-
ing independent variable. But then we lose generalization power; for instance we can say
nothing about the relation under study with girls if we use only boys in the experiment. If
the range of intelligence is restricted, then we can discuss only this restricted range. Is it
possible that the relation, if discovered, is nonexistent or quite different with children of
high intelligence or children of low intelligence? We simply do not know; we can only
surmise or guess.
The second way to control extraneous variance is through randomization. This is the
best way, in the sense that you can have your cake and eat some of it, too. Theoretically,
randomization is the only method of controlling all possible extraneous variables. Another
way to phrase it is: if randomization has been accomplished, then the experimental groups
can be considered statistically equal in all possible ways. This does not mean, of course,
that thegroups are equal in all the possible variables. We already know that by chance the
groups can be unequal, but the probability of their being equal is greater, with proper
randomization, than the probability of their not being equal. For this reason control of the
extraneous variance by randomization is a powerful method of control. All other methods
leave many possibilities of inequality. If we match for intelligence, we may successfully
achieve statistical equality in intelligence (at least in those aspects of intelligence meas-
ured), but we may suffer from inequality in other significantly influential independent
variables like aptitude, motivation, and social class. A precept that springs from this
equalizing power of randomization, then, is: Whenever possible to do so, randomly
it is
assign subjects to experimental groups and conditions, and randomly assign conditions
and other factors to experimental groups.
The third means of controlling an extraneous variable is to build it right into the design
as an independent variable. For example, assume that sex was to be controlled in the
experiment discussed earlier and it was considered inexpedient or unwise to eliminate it.
One could add a third independent variable, sex, to the design. Unless one were interested
in the actual difference between the sexes on the dependent variable or wanted to study the
interaction between one or two of the other variables and sex, however, it is unlikely that
this form of control would be used. One might want information of the kind just men-
tioned and also want to control sex, too. In such a case, adding it to the design as a
variable might be desirable. The point is that building a variable into an experimental
design "controls"" the variable, since it then becomes possible to extract from the total
variance of the dependent variable the variance due to the variable. (In the above case, this
would be the "between-sexes"" variance.)
These considerations lead to another principle: An extraneous variable can be con-
trolled by building it into the research design as an attribute variable, thus achieving
control and yielding additional research information about the effect of the variable on
the dependent variable and about its possible interaction with other independent varia-
bles.
The fourth way to control extraneous variance is to match subjects. The control princi-
ple behind matching is the same as that for any other kind of control, the control of
Research Design: Purpose and Principles • 289
a factorial design, and then randomize within each level as described above. Matching is
a special case of this principle. Instead of splitting the subjects into two, three, or four
parts, however, they are split into Nil parts, N being the number of subjects used; thus the
control of variance is built into the design.
matching method several problems may be encountered. To begin with,
In using the
the variable on which the subjects are matched must be substantially related to the depend-
ent variable or the matching is a waste of time. Even worse, it can be misleading. In
addition, matching has .severe limitations. If we try to match, say, on more than two
variables, or even more than one, we lose subjects. It is difficult to find matched subjects
on more than two variables. For instance, if one decides to match intelligence, sex, and
social class, one may be fairly successful in matching the first two variables but not in
finding pairs that are fairly equal on all three variables. Add a fourth variable and the
problem becomes difficult, often impossible to solve.
Let us not throw the baby out with the bath, however. When there is a substantial
correlation between the matching variable or variables and the dependent variable (>.50
or .60), then matching reduces the error term and thus increases the precision of an
experiment, a desirable outcome. If the same subjects are used with different experimental
treatments —
called repeated measures or randomized blocks design we have powerful —
control of variance. How match better on all possible variables than by matching a subject
with himself? Unfortunately, other negative considerations usually rule out this possibil-
ity. It should be forcefully emphasized that matching of any kind is no substitute for
randomization. If subjects are matched, they should then be assigned to experimental
groups at random. Through a random procedure, like tossing a coin or using odd and even
random numbers, the members of the matched pairs are assigned to experimental and
control groups. If the same subjects undergo all treatments, then the order of the treat-
ments should be assigned randomly. This adds randomization control to the matching, or
repeated measures, control.
A principle suggested by this discussion is: When a matching variable is substantially
correlated with the dependent variable, matching as a form of variance control can be
profitable and desirable. Before using matching, however, carefully weigh its advantages
and disadvantages in the particular research situation. Complete randomization or the
analysis of covariance may be better methods of variance control.
Still another form of control, statistical control, was discussed at length in Part Five,
but one or two further remarks are in order here. Statistical methods are, so to speak,
forms of control in the sense that they isolate and quantify variances. But statistical
control is inseparable from other forms of design control. If matching is used, for exam-
ple, an appropriate statistical test must be used, or the matching effect, and thus the
control, will be lost.
vidual differences "systematic variance." But when such variance cannot be. or is not
identifiedand controlled, we have to lump it with the error variance. Because many
determinants interact and tend to cancel each other out (or at least we assume that they
do), the error variance has this random characteristic.
Another source of error variance is that associated with what are called errors of
measurement: variation of responses from trial to trial, guessing, momentary inattention,
slight temporary fatigue and lapses of memory, transient emotional states of subjects, and
so on.
Minimizing error variance has two principal aspects: 1 the reduction of errors of ( )
measurement through controlled conditions, and (2) an increase in the reliability of meas-
ures. The more uncontrolled the conditions of an experiment, the more the many determi-
nants of error variance can operate. This is one of the reasons for carefully setting up
controlled experimental conditions. In studies under field conditions, of courre. such
control is difficult; still, constant efforts must be made to lessen the effects of the many
determinants of error variance. This can be done, by specific and clear instructions
in part,
to subjects and by excluding from the experimental situation factors that are extraneous to
the research purpose.
To increase the reliability of measures is to reduce the error variance. Pending fuller
discussion later in the book, reliability can be taken to be the accuracy of a set of scores.
To the extent that scores do not fluctuate randomly, to this extent they are reliable.
Imagine a completely unreliable measurement instrument, one that does not allow us to
predict the future performance of individuals at all. one that gives one rank ordering of a
sample of subjects at one time and a completely different rank ordering at another time.
With such an instrument, it would not be possible to identify and extract systematic
variances, since the scores yielded by the instrument would be like the numbers in a table
of random numbers. This is the extreme case. Now imagine differing amounts of reliabil-
ity and unreliability in the measures of the dependent variable. The more reliable the
measures, the better we can identify and extract systematic variances and the smaller the
error variance in relation to the total variance.
Another reason for reducing error variance as much as possible is to give systematic
variance a chance to show itself. We cannot do this if the error variance, and thus the error
term, is too large. If a relation exists, we seek to discover it. One way to discover the
relation is between means. But if the error variance is
to find significant differences
relatively large due to uncontrolled errors of measurement, the systematic variance
earlier called "between" variance —
will not have a chance to appear. Thus the relation,
although it exists, will probably not be detected.
The problem of error variance can be put into a neat mathematical nutshell. Remember
the equation:
V, = V„ + V,
where V, is the total variance in a set of measures; Vi, is the between-groups variance, the
variance presumably due to the influence of the experimental variables; and V^, is the error
variance (in analysis of variance, the within-groups variance and the residual variance).
Obviously, the larger V^ is, the smaller Vj, must be, with a given amount of V,.
Consider the following equation: F = V/,/V<.. For the numerator of the fraction on the
right to be accurately evaluated for significant departure from chance expectation, the
denominator should be an accurate measure of random error.
A familiar example may make this clear. Recall that in the discussions of factorial
analysis of variance and the analysis of variance of correlated groups we talked about
variance due to individual differences being present in experimental measures. We said
that, while adequate randomization can effectively equalize experimental groups, there
Research Design: Purpose and Principles • 291
will be variance in the scores due to individual differences, for instance, differences due to
intelligence, aptitude, and so on. Now, in some situations, these individual differences
can bo quite large. If they are. then the error variance and. consequently, the denominator
of the F equation, above, will be "too large" relative to the numerator; that is. the
"
individual differences will have been randomly scattered among, say, two, three, or four
experimental groups. Still they are sources of variance and. as such, will inflate the
Study Suggestions
1. We have noted that research design has the purposes of obtaining answers to research ques-
tions and controlling variance. Explain in detail what this statement means. How does a research
design control variance? Why should a factorial design control more variance than a one-way
design? How does a design matched subjects or repeated measures of the same subjects
that uses
control variance? What is between the research questions and hypotheses and a research
the relation
design? In answering these questions, make up a research problem to illustrate what you mean (or
use an example from the text).
2. Sir Ronald Fisher, the inventor of analysis of variance, said, in one of his books,
it should be noted that the null hypothesis is never proved or established, but is possibly
disproved, in the course of experimentation. Every experiment may be said to exist only in
order to give the facts a chance of disproving the null hypothesis."
Whether you agree or disagree with Fisher's statement, what do you think he meant by it? In
framing your answer, think of the maxmincon principle and F tests and / tests.
'R. Fisher, The Design of Experiments, 4th ed. New York: Hafner, 1951, p. 16.
Chapter i 8
Inadequate Designs
and Design Criteria
All man's disciplined creations have form. Architecture, poetry, music, painting, mathe-
matics, scientific research — all have form. Man puts great stress on the content of his
creations, often not realizing that without strong structure, no matter how rich and how
significant the content, the creations may be weak and sterile.
So it is with scientific research. The scientist needs viable and plastic form with which
to express scientific aims. Without content —
without good theory, good hypotheses, good
problems —
the design of research is empty. But without form, without structure ade-
quately conceived and created for the research purpose, little of value can be accom-
plished. Indeed, it is no exaggeration to say that many of the failures of behavioral
research have been failures of disciplined and imaginative form.
The principal focus of this chapter is on inadequate research designs. Such designs
have been so common that they must be discussed. More important, the student should be
able to recognize them and understand ii7iy they are inadequate. This negative approach
has a virtue: the study of deficiencies forces one to ask why something is deficient, which
in turn centers attention on the criteria used to judge both adequacies and inadequacies. So
the study of inadequate designs leads us to the study of the criteria of research design. We
take the opportunity, too, to describe the symbolic system to be used and to identify an
important distinction between experimental and nonexperimental research.
Inadequate Designs and Designs Criteria • 293
study relations among phenomena. Their scientific logic is also the same: to bring empiri-
cal evidence to bear on conditional statements of the form If p, then q.
The ideal of science is the controlled experiment. Except, perhaps, in taxonomic
research —
research with the purpose of discovering, classifying, and measuring natural
—
phenomena and the factors behind such phenomena the controlled experiment is the
desired model of science. It may be difficult for many students to accept this rather
categorical statement since its logic is not readily apparent. Earlier it was said that the
main goal of science was to discover relations among phenomena. Why, then, assign a
priority to the controlled experiment? Do not other methods of discovering relations exist?
Yes, of course they do. The main reason for the preeminence of the controlled experi-
ment, however, is can have more confidence that the relations they study
that researchers
are the relations they think they are. The reason
is not hard to see: they study the relations
under the most carefully controlled conditions of inquiry known. The unique and over-
whelmingly important virtue of experimental inquiry, then, is control. In short, a perfectly
conducted experimental study is more trustworthy than a perfectly conducted nonexperi-
mental study Why this is so should become more and more apparent as we advance in our
.
The dependent variable is Y: Yi, is the dependent variable before the manipulation of X.
and Ya the dependent variable after the manipulation of X. With ~X, we borrow the
294 • Designs of Research
negation sign of set theory; ~X ("not-X") means that the experimental variable, the
independent variable X. is not manipulated. (Note: is a nonmanipulable variable and @
~X is a manipulable variable that is not manipulated.) The symbol [R] will be used for the
random assignment of subjects to experimental groups and the random assignment of
experimental treatments to experimental groups.
The explanation of ~X. just given, is not quite accurate, because in some cases ~X
can mean a different aspect of the treatment X rather than merely the absence of X. In an
older language, the experimental group was the group that was given the so-called experi-
mental treatment, X, while the control group did not receive it, ~X. For our purposes,
however, ~X will do well enough, especially if we understand the generalized meaning of
"control" discussed below. An experimental group, then, is a group of subjects receiving
some aspect or treatment of X. In testing the frustration-aggression hypothesis, the experi-
mental group is the group whose subjects are systematically frustrated. In contrast, the
control group is one that is given "no" treatment.
In modem multivariate research, it is necessary to expand these notions. They are not
changed basically; they are only expanded. It is quite possible to have more than one
experimental group, as we have seen. Different degrees of manipulation of the indepen-
dent variable are not only possible; they are often also desirable or even imperative.
Furthermore, it is possible to have more than one control group, a statement that at first
seems like nonsense. How can one have different degrees of "no" experimental treat-
—
ment? because the notion of control is generalized. When there are more than two
groups, and when any two of them are treated differently, one or more groups serve as
"controls" on the others. Recall that control is always control of variance. With two or
more groups treated differently, variance is engendered by the experimental manipulation.
So the traditional notion of X and ~X, treatment and no treatment, is generalized to
X], X2, . . . , Xk, different forms or degrees of treatment.
If X is circled, @ , this means that the investigator "imagines" the manipulation of X,
or he assumes that X occurred and that it is the X of his hypothesis. It may also mean that
X ismeasured and not manipulated. Actually, we are saying the same thing here in
different ways. The context of the discussion should make the distinction clear. Suppose a
sociologist is studying delinquency and the frustration-aggression hypothesis. He ob-
serves delinquency, and imagines that his delinquent subjects were frustrated in their
Y,
earlier years, or ®
All nonexperimental designs will have
. Generally, then, (X) @ .
means an independent variable not under the experimental control of the investigator.
—
One more point each design in this chapter will ordinarily have an a and a b form.
The a form will be the experimental form, or that ip which X is manipulated. The b form
will be the nonexperimental form, that in which X is not under the control of the investiga-
tor, or @.Obviously, (^^is also possible.
FAULTY DESIGNS
There are four (or more) inadequate designs of research that have often been used and —
are occasionally still used —
in behavioral research. The inadequacies of the designs lead
to poor control of independent variables. We number each such design, give it a name,
sketch its structure, and then discuss it.
(a) X Y (Experimental)
Design 18. (a) has been called the '"One-Shot Case Study." an apropos name.' The
1
(a) form is experimental, the (b) form nonexperimental. An example of the (a) form: a
school faculty institutes a new curriculum and wishes to evaluate its effects. After one
year. }'. student achievement, is measured. It is concluded, say, that achievement has
improved under the new program. With such a design the conclusion is weak. Design
18.1 (b) is the nonexperimental form of the one-group design. Y. the outcome, is studied,
and X is assumed or imagined. An example would be to study delinquency by searching
the past of a group of juvenile delinquents for factors that may have led to antisocial
behavior.
Design 18.1 is worthless. There is virtually no control of other possible
Scientifically,
influences on outcome. As Campbell long ago pointed out, the minimum of useful scien-
tific information requires at least one formal comparison." The curriculum example re-
quires, (;/ the k'itst, comparison of the group that experienced the new curriculum with a
group that did not experience it. The presumed effect of the new curriculum, say such-
and-such achievement, might well have been about the same under any kind of curricu-
lum. The point is not that the new curriculum did or did not have an effect, but that, in
the absence of any formal controlled comparison of the performance of the members of
,
the "experimental"" group with the performance of the members of some other group not
experiencing the new curriculum, little can be said about its effect.
An important distinction should be made. It is not that the method is entirely worth-
less, but that it is scientifically worthless. In everyday life, of course, we depend on such
scientifically questionable evidence; we have to. We act, we say, on the basis of our
experience. We hope that we use our experience rationally. The everyday-thinking para-
digm implied by Design 18.1 is not being criticized. Only when such a paradigm is used
and said or believed to be scientific do difficulties arise. Even in high intellectual pursuits,
the thinking implied by this design is used. Freud's careful observations and brilliant and
creative analysis of neurotic behavior seem to fall into this category. The quarrel is not
with Freud, then, but rather with assertions that his conclusions are "scientifically estab-
lished."
(b) Yh
® Ya (Nonexperimental)
Design 18.2 is only a small improvement on Design 18.1. The essential characteristic
of this mode of research is that a group is compared with itself. Theoretically, there is no
better choice since all possible independent variables associated with the subjects' charac-
teristics are controlled. The procedure dictated by such a design is as follows. A group is
D. Campbell and J. Stanley, Experimental and Quasi-Experimental Designs for Research. Chicago; Rand
'
McNally. 1963. p. 6.
-D. Campbell. "Factors Relevant to the Vahdity of Experiments in Social Settings," Psychological Bulle-
tin. 54 (1957), 297-312.
296 • Designs of Research
factors that may have contributed to the change in scores. Campbell gives an excellent
detailed discussion of these factors,-^ only a brief outline of which can be given here.
First is the possible effect of the measurement procedure: measuring subjects changes
them. Can be that the post-X measures were influenced not by the manipulation of X but
it
by increased sensitization due to the pretest? Campbell calls such measures reactive meas-
ures, because they themselves cause the subject to react. Controversial attitudes, for
example, seem to be especially susceptible to such sensitization. Achievement measures,
though probably less reactive, are still affected. Measures involving memory are suscepti-
ble. If you take a test now, you are more likely to remember later things that were
included in the test. In short, observed changes may be due to reactive effects.
Two other important sources of extraneous variance are history and maturation Be- .
tween the Yf, and Y„ testings, many things can occur other than X. The longer the period of
time, the greater the chance of extraneous variables affecting the subjects and thus the ¥„
measures. This is what Campbell These variables or events are specific to
calls history.
the particular experimental situation. Maturation, on the other hand, covers events that
are general, not specific to any particular situation. They reflect change or growth in the
organism studied. Mental age increases with time, an increase that can easily affect
achievement, memory, and attitudes. People learn in any given time interval and the
learning may affect dependent variable measures. This is one of the exasperating difficul-
ties of research that extends over considerable time periods. The longer the time interval,
the greater the possibility that extraneous, unwanted sources of systematic variance will
influence dependent variable measures.
A statistical phenomenon that has misled researchers is the so-called regression effect.
Test scores change as a statistical fact of life: on retest, on the average, they regress
toward the mean. The regression effect operates because of the imperfect correlation
between the pretest and posttest scores. If /„,, = 1 .00. then there is no regression effect: if
rat,
= .00, the effect is at a maximum in the sense that the best prediction of any posttest
score from pretest score is the mean. With the correlations found in practice, the net effect
is thatlower scores on the pretest tend to be higher, and higher scores lower on the
posttest —
when, in fact, no real change has taken place in the dependent variable. Thus, if
low-scoring subjects are used in a study, their scores on the posttest will probably be
higher than on the pretest due to the regression effect. This can deceive the researcher into
believing that the experimental intervention has been effective when it really has not.
Similarly, one may erroneously conclude that an experimental variable has had a depress-
ing effect on high pretest scorers. Not necessarily so. The higher and lower scores of the
two groups may be due to the regression effect.
How does this work? There are many chance factors at work in any set of scores.'* On
^Ibid.. pp. 298-300. The first point discussed, the possible interaction effect of the pretest, seems first to
have been pointed out by Solomon in an excellent article: R. Solomon. "An Extension of Control Group
Design," Psychological Bulletin. 46 (1949). 137-150. Also see S. Stouffer, "Some Observations on Study
Design," American Journal of Sociology. 55 (1950). 355-361.
*Much of this explanation is due to Anastasi's clear discussion of the regression effect; A. Anaslasi, Differ-
ential Psychology. 3d ed. New York: Macmillan. 1958, pp. 203-205. For an equally excellent and more
complete discussion, see R. Thomdike. Concepts of Over- and Underachievement. New York: Teachers College
Press. 1963. pp. 1-15. The statistically sophisticated student should consult J. Nesselroade, S. Sligler. and P.
1
Baltes, "Regression Toward the Mean and the Study of Change." Psychological Bulletin. 88 (1980), 622-637.
Inadequate Designs and Designs Criteria • 297
the pretest some high scores are higher than "they should be" due to chance, and simi-
larly withsome low scores. On the posttest it is unlikely that the high scores will be
maintained, because the factors that made them high were chance factors which are —
unct)rrclalcd on the pretest and posttest. Thus the high scorer will tend to drop on the
posttest. A similar argument applies to the low scorer but in reverse. —
Research designs have to be constructed with the regression effect in mind. There is
no way in Design 18.2 to control it. If there were a control group, then one could
"control" the regression effect, since both experimental and control groups have pretest
and posttest. If the experimental manipulation has had a "real" effect, then it should be
apparent over and above the regression effect. That is, the scores of both groups, other
things equal, arc affected the same by regression and other influences. So if the groups
differ in the posttest, should be due to the experimental manipulation.
it
Design 18.2 is inadequate not so much because extraneous variables and the regres-
sion effect can operate (the extraneous variables operate
whenever there is a time interval
between pretest and posttest), but because we do not know whether they have operated,
whether they have affected the dependent-variable measures. The design affords no op-
portunity to control or to test such possible influences.
The peculiar title of this design stems in part from its very nature. Like Design 18.2 it
is a before-after design. Instead of using the before and after (or pretest-posttest) measures
of one group, we use as pretest measures the measures of another group, which are chosen
to be as similar as possible to the experimental group and thus a control group of a sort.
(The line between the two levels, above, indicates separate groups.) This design satisfies
the condition of having a control group and is thus a gesture toward the comparison that is
necessary to scientific investigation. Unfortunately, the controls are weak, a result of our
inability to know that the two groups were equivalent before X, the experimental manipu-
lation.
X Y
(a) (Experimental)
~X Y
Stouffer says, "there is all too often a wide-open gate through which other uncontrolled
sex is irrelevant.
Another example of this weakness is the case where three or four experimental groups
are needed —
for example, three experimental groups and one control group, or four
groups with different amounts or aspects of X. the experimental treatment and the inves- —
tigator uses only two because he has heard that an experimental group and a control group
are necessary and desirable.
The example discussed in Chapter 17 of testing an interaction hypothesis by perform-
ing, in effect, two separate experiments is another example. The hypothesis to be tested
was that discrimination in college admissions is a function of both sex and ability level,
that it is women of low ability who are excluded (in contrast to men of low ability). This is
an interaction hypothesis and probably calls for a factorial-type design. To set up two
experiments, one for college applicants of high ability and another for applicants of low
ability, is poor practice because such a design, as shown earlier, cannot decisively test the
stated hypothesis. Similarly, to match subjects on ability and then set up a two-group
design would miss the research question entirely. These considerations lead to a general
and seemingly obvious precept;
The second criterion is control, which means control of independent variables: the inde-
pendent variables of the research study and extraneous independent variables. Extraneous
independent variables, of course, are variables that may influence the dependent varia-
ble but that are not part of the study. In the admissions study of Chapter 17, for example,
geographical location (of the colleges) may be a potentially influential extraneous variable
that can cloud the results of the study. If colleges in the east, for example, exclude more
women than colleges in the west, then geographical location is an extraneous source of
variance in the admissions measures — which should somehow be controlled. The crite-
rion also refers to control of the variables of the study. Since this problem has already
been discussed and will continue to be discussed, no more need be said here. But the
question must be asked; Does this design adequately control independent variables?
The best single way to answer this question satisfactorily is expressed in the following
principle:
Randomize whenever possible: select subjects at random: assign subjects to groups at random:
assign experimental treatments to groups at random.
While it may not be possible to select subjects at random, it may be possible to assign
them to groups at random, thus "equalizing" the groups in the statistical sense discussed
in Parts Four and Five. If such random assignment of subjects to groups is not possible,
then every effort should be made to assign experimental treatments to experimental groups
at random. And, if experimental treatments are administered at different times with differ-
ent experimenters, times and experimenters should be assigned at random.
The principle that makes randomization pertinent is complex and difficult to imple-
ment:
Control the independent variables so that extraneous and unwanted sources of systematic vari-
As we have seen earlier, randomization theoretically satisfies this principle (see Chapter
8). When we empirical validity of an If p, then q proposition, we manipulate p
test the
and observe that q covaries with the manipulation of p. But how confident can we be that
our If p, then q statement is really "true"? Our confidence is directly related to the
completeness and adequacy of the controls. If we use a design similar to Designs 18.1
through 18.4, we cannot have too much confidence in the empirical validity of the If p,
then q statement, since our control of extraneous independent variables is weak or nonex-
istent. Because such control is not always possible in much psychological, sociological,
and educational research, should we then give up research entirely? By no means. Never-
theless,we must be aware of the weaknesses of intrinsically poor design.
Generalizability
generalize the results of a study to other subjects, other groups, and other conditions?
Perhaps the question is better put: How much can we generalize the results of the study?
This is probably the most complex and difficult question that can be asked of research data
because it touches not only on technical matters like sampling and research design, but
also on larger problems of basic and applied research. In basic research, for example,
generalizability is not the first consideration, because the central interest is the relations
300 • Designs of Research
among variables and why the variables are related as they are.* This emphasizes the
internal rather than the external aspects of the study. In applied research, on the other
hand, the central interest forces more concern for generalizability because one certainly
wishes to apply the results to the other persons and to other situations. If the reader will
ponder the following two examples of basic and applied research, he can get closer to this
distinction.
In Chapter 14 we examined a study by Berkowitz on hostility arousal, anti-Semitism,
and displaced aggression. This is clearly basic research: the central interest was in the
relations among hostility, anti-Semitism, and displaced aggression. While no one would
be foolish enough to say that Berkowitz was not concerned with hostility, anti-Semitism,
and displaced aggression emphasis was on the relations among the varia-
in general, the
bles of the study. Contrast this study with the effort of Walster etal. to determine whether
colleges discriminate against women. Naturally, Walster and her colleagues were particu-
lar about the internal aspects of their study. But they perforce had to have another interest:
Two general criteria of research design have been discussed at length by Campbell and by
Campbell and Stanley.^ These notions constitute one of the most significant, important,
and enlightening contributions to research methodology in the last two or three decades.
Internal validity asks the question: Did X, the experimental manipulation, really make
a significant difference? The three criteria of the last chapter are actually aspects of
internal validity. Indeed, anything affecting the controls of a design becomes a problem of
internal validity. If a design is such that one can have little or no confidence in the
relations, as shown by significant differences between experimental groups, say, this is a
problem of internal validity.
A difficult criterion to satisfy, external validity means representativeness or generaliz-
ability. When an experiment has been completed and a relation found, to what populations
can it be generalized? Can we say that A is related to B for all school children? All
eighth-grade children? All eighth-grade children in this school system or the eighth-grade
children of this school only? Or must the findings be limited to the eighth-grade children
with whom we worked? This is a very important scientific question that should always be
asked and answered.
Not only must sample generalizability be questioned. It is necessary to ask questions
about the ecological and variable representativeness of studies. If the social setting in
which the experiment was conducted is changed, will the relation of A and B still hold?
Will A be related to B if the study is replicated in a lower-class school? In a western
school? In a southern school? These are questions of ecological representativeness.
Variable representativeness is more subtle. A question not often asked, but that should
be asked, is: Are the variables of this research representative? When an investigator works
with psychological and sociological variables, he assumes that his variables are "con-
stant." If he finds a difference in achievement between boys and girls, he assumes that
sex as a variable is "constant."
*For a brief discussion of basic and applied research, see F. Kerlinger, "Research in Education." In R.
Ebel, V. Noll, and R. Bauer, eds.. Encyclopedia of Educational Research. 4th ed. New York; Macmillan. 1969,
pp. 127-1 143, esp. p. 1 128. This article also cites a number of references on basic and applied research.
1
'Campbell, op, cit.: Campbell and Stanley, op. cil. Readers are urged to study these sources, since the above
discussion can only 4efine and highlight internal and external validity.
Inadequate Designs and Designs Criteria •
301
In the case of variables like achievement, aggression, aptitude, and anxiety, can the
investigator assume that the "aggression"" of his suburban subjects is the same "aggres-
sion" to be found the variable the same in a European suburb? The
in city slums? Is
ety," what kind of anxiety do we mean? Are all kinds of anxiety the same? If anxiety is
manipulated in one situation by verbal instructions and in another situation by electric
shock, are the two induced anxieties the same? manipulated by, say, experi-
If anxiety is
mental instruction, is this the same anxiety as that measured by an anxiety scale? Variable
representativeness, then, is another aspect of the larger problem of external validity, and
thus of generalizability.
Unless special precautions are taken and special efforts made, the results of research
are frequently not representative, and hence not generalizable. Campbell and Stanley say
qua non of research design, but that the ideal design should
that internal validity is the sine
be strong in both internal validity and external validity, even though they are frequently
contradictory. This point is well taken. In these chapters, the main emphasis will be on
internal validity, with a vigilant eye on external validity.
The negative approach of this chapter was taken in the belief that an exposure to poor
but commonly used and accepted procedures, together with a discussion of their major
weaknesses, would provide a good starting point for the study of research design. Other
inadequate designs are possible, but all such designs are inadequate on design-structural
principles alone. This point should be emphasized because in the next chapter we will find
that a perfectly good design structure can be poorly used. Thus it is necessary to learn and
understand the two sources of research weakness: intrinsically poor designs and intrinsi-
cally good designs poorly used.
Study Suggestions
1. The faculty of a liberal arts college has decided to begin a new curriculum for all undergrad-
uates. It asks a faculty research group to study the program's effectiveness for two years. The
research group, wanting to have a group with which to compare the new curriculum group, requests
that the present program be continued for two years and that students be allowed to volunteer for the
present or the new program. The research group believes that it will then have an experimental
group and a control group.
Discuss the research group's proposal critically. How much faith would you have in the findings
at theend of two years? Give reasons for positive or negative reactions to the proposal.
2. Imagine that you are a graduate school professor and have been asked to judge the worth of
a proposed doctoral thesis. The doctoral student is a school superintendent who is instituting a new
type of administration into his school system. He plans to study the effects of the new administration
for a three-year period and then write his thesis. He will not study any other school situation during
the period so as not to bias the results, he says.
Discuss the proposal. When doing so, ask yourself: Is the proposal suitable for doctoral work?
3. In your opinion, should all research be held rather strictly to the criterion of generalizability?
If so, why? If not, why not? Which field is likely to have more basic research: psychology or
education? Why? What implications does your conclusion have for generalizability?
4. What does replication of research have to do with generalizability? Explain. If it were
possible, should all research be replicated? If so, why? What does replication have to do with
external and internal validity?
Chapter i9
General Designs
of Research
The ordered pairs, then, are: A^Bi, AiBj, A^Si, A2B2. Since we have a set of ordered
pairs, this is a relation. It is also a cross partition. The reader should look back
at Figures
4.7 and 4.8 of Chapter 4 to help clarify these ideas, and to see the application of the
Cartesian product and relation ideas to research design. For instance, /l, and A2 can be
two aspects of any independent variable: experimental-control, two methods, male and
female, and so on.
A clesii^n is some subset of the Cartesian product of the independent variables and the
dependent variable. It is possible to pair each dependent variable measure, which we call
Y in this discussion, with some aspect or partition of an independent variable. The sim-
plest possible cases occur with one independent variable and one dependent variable. In
Chapter 10, an independent variable, A, and a dependent variable, B. were partitioned
into [A I, A2] and [fii, Bt] and then cross-partitioned to form the by-now familiar 2x2
crossbreak, with frequencies or percentages in the cells. We concentrate, however, on
similar cross partitions of A
and B, but with continuous measures in the cells.
Take A alone, using a one-way analysis of variance design. Suppose we have three
experimental treatments, A\, A2. and A3, and, for simplicity, two Y scores in each cell.
This is shown on the left of Figure 19.1, labeled (a). Say that six subjects have been
assigned at random to the three treatments, and that the scores of the six individuals after
the experimental treatments are those given in the figure.
The shows the same idea in ordered-pair or
right side of Figure 19.1, labeled (b),
relation form. The ordered pairs are Ai^i, AiKi, Aii's AjY(,. This is, of course, not
a Cartesian product, which would pair A with all the F's, At with all the Y's, and A3 with
1
all the y's, a total of 3 x 6 = 18 pairs. Rather, Figure 19. 1(b) is a subset of the Cartesian
product, A X B. Research designs are subsets of A x B, and the design and the research
problem define or specify how the subsets are set up. The subsets of the design of Figure
19.1 are presumably dictated by the research problem.
(a) (b)
A, A, A,
7 7 3
9 5 3
Figure 19.1
304 • Designs of Research
(a)
Bi
A-,
General Designs of Research • 305
partition of X. But .Y can he partitioned into a number of X's, perhaps changing the design
from a simple one-variable design to, say, a factorial design. The basic symbolism associ-
ated with Design 19. however, remains the same. These complexities
1 . will, we hope, be
clarified in this and succeeding chapters.
especially factorial designs, were bom when analysis of variance was invented. Although
there is no hard law that says that analysis of variance is applicable only in experimental
situations —
indeed, it has been used many times in nonexperimental research it is in —
general true that it is most appropriate for the data of experiments.'' This is especially so
for factorial designs where there are equal numbers of cases in the design paradigm cells,
and where the subjects are assigned to the experimental conditions (or cells) at random.
When it is not possible to assign subjects at random, and when, for one reason or
another, there are unequal numbers of cases in the cells of a factorial design, the use of
analysis of variance is questionable, even inappropriate. It can also be clumsy and inele-
gant. This is because the use of analysis of variance assumes that the correlations between
or among the independent variables of a factorial design are zero. Random assignment
makes this assumption tenable since such assignment presumably apportions sources of
variance equally among the cells. But random assignment can only be accomplished in
experiments. In nonexperimental research, the independent variables are more or less
fixed characteristics of the subjects, e.g., intelligence, sex, social class, and the like.
They are usually systematically correlated. Take two independent manipulated variables,
say reinforcement and anxiety. Because subjects with varying amounts of characteristics
correlated with these variables are randomly distributed in the cells, the correlations
between aspects of reinforcement and anxiety are assumed to be zero. If, on the other
hand, the two independent variables are intelligence and social class, both ordinarily
nonmanipulable and correlated, the assumption of zero correlation between them neces-
sary for analysis of variance cannot be made. Some method of analysis that takes account
of the correlation between them should be used. We will see later in the book that such a
method is readily available: multiple regression.
We have not yet reached a state of research maturity to appreciate the profound
difference between the two situations. For now, however, let us accept the difference and
the statement that analysis of variance is basically an experimental conception and form of
analysis. Strictly speaking, if our independent variables are nonexperimental, then analy-
sis of variance is not the appropriate mode of analysis.'* Similarly, if for some reason the
'See W. Hays. Staiislics. 3d ed. New York: Holt Rinehart and Winston, 1981. Hays virtually equates
experimental design models and analysis of variance models (pp. 328-329).
"There are exceptions to this statement. For instance, if one independent variable is experimental and one
nonexperimental, analysis of variance is appropriate. In one-way analysis of variance, moreover, since there is
only one independent variable, analysis of variance can be used with a nonexperimental independent variable,
though regression analysis would probably be more appropriate. In Study Suggestion 5 at the end of the chapter,
an interesting use of analysis of variance with nonexperimental data is cited.
306 • Designs of Research
numbers of cases in the cells are unequal (and disproportionate), then there will be correl-
ation between the independent variables, and the assumption of zero correlation is not
tenable. This rather abstract and abstruse digression from our main design theme may
seem a bit confusing at this stage of our study. The problems involved should become
clear afterwe have studied experimental and nonexperimental research and, later in the
book, that fascinating and powerful approach known as muhiple regression.
THE DESIGNS
In the remainder of this chapter we discuss four or five basic designs of research. Remem-
ber that a design is a plan, an outline for conceptualizing the structure of the relations
among the variables of a research study. A design not only lays out the relations of the
study; it also implies how the research situation is controlled and how the data are to be
analyzed. A design, in the sense of this chapter, is the skeleton on which we put the
variable-and-relation flesh of our research. The sketches given in Designs 19.1 through
19.8, following, are designs, the bare and abstract structure of the research. Sometimes
analytic tables, such as Figure 19.2 (on the left) and the figures of Chapter 17 (e.g..
Figures 17.2. 17.3. and 17.5) and elsewhere are called designs. While no great harm is
done by calling them designs, they are, strictly speaking, analytic paradigms. We will not
be fussy, however. We will call both kinds of representations "designs."
E (Experimental)
(Control)
Design 19.1, with two groups as above, and its variants with more than two groups,
are probably the "best" designs for many experimental purposes in behavioral research.
The |_^ before the paradigm indicates that subjects are randomly assigned to the experi-
mental group (top line) and the control group (bottom line). This randomization removes
the objections to Design 18.4 mentioned in Chapter 18. Theoretically, all possible inde-
pendent variables are controlled. Practically, of course, this may not be so. If enough
subjects are included in the experiment to give the randomization a chance to "operate."
then we have strong control, and the claims of internal validity are rather well satisfied.
If extended to more than two groups and if it is capable of answering the research
questions asked. Design 19.1 has the following advantages: (1) it has the best built-in
theoretical control system of any design, with one or two possible exceptions in special
cases; (2) it is flexible, being theoretically capable of extension to any number of groups
with any number of variables; (3) if extended to more than one variable, it can test several
hypotheses at one time; and (4) it is statistically and structurally elegant.
Before taking up other designs, we need to examine the notion of the control group,
one of the creative inventions of the last hundred years, and certain extensions of Design
19.1. The two topics go nicely together.
'E. Boring, "The Nature and History of Experimental Control." American Journal of Psychology, 67
(1954), 573-589.
General Designs of Research • 307
X, Y
E ^2
.V, Y
''R. Solomon. "An Extension of Control Group Design." Psychological Bulletin. 46 (1949). 137-150.
Perhaps the notion of the control group was used in other fields, though it is doubtful that the idea was well
developed. Solomon (p. 175) also says that the Peterson and Thurstone study of attitudes in 1933 was the first
serious attempt to use control groups in the evaluation of the effects of educational procedures. One cannot find
the expression "control group" in the famous eleventh edition (1911) of the Encyclopaedia Britannica. even
though experimental method is discussed.
^E. Thorndike and R. Woodworth, "The Influence of Improvement in One Mental Function upon the
Efficiency of Other Functions," Psychological Review. 8 (1901). 247-261, 384-395, 553-564.
*E. Thorndike, "Mental Discipline in High School Subjects," Journal of Educational Psychology, 15
(1924), 1-22, 83-98.
'See ibid., pp. 93, 97, and 85 for the three points mentioned.
308 • Designs of Research
X\a
General Designs of Research • 309
2
3 Y Measures
4
5
Figure 19.4
on more than one trial, because a subject is naturally more like himself than he is like
other persons. The three experimental groups of Figure 19.4 can be the same subjects on
different trials, which introduces systematic variance, individual differences variance.
When pretests and posttests are used, matching is of course present, too. Schools are
known to differ in important characteristics: classes differ, school districts differ, neigh-
borhoods differ, teachers differ. These differences can be used in the study and the
variances arising form their use can be isolated by building their sources into the design.
In fact, failure to build such variables into designs can lead to the confounding of the
experimental variables.'" And to have correlation between groups, due to matching or
repeated measures of individuals or units like classes and schools, and not to take advan-
'
tage of the correlation is a statistical and design blunder.
Yh X (Experimental)
(a)
S -X
Ya
Y„ (Control)
y„
(Experimental)
(b) M,
(Control)
'"The terms "confounding" means the "mixing" of the variance of one or more independent variables,
usually extraneous to the research purpose, with the independent variable or variables of the research problem.
As a result it cannot be clearly said that the relation found is between the independent variables and the
dependent variable of the research. Underwood points out that there is only one basic principle of research
design: "design the expenment so that the effects of the independent variables can be evaluated unambigu-
ously." (B. Underwood, Psychological Research. New York: Appleton, 1957, p. 86.) When this cannot be
— —
done it is a difficult procedure more likely than not the independent variables have been confounded. The
term evidently came from statistics, where "confounding" is sometimes deliberately practiced.
"It must be emphasized that matching is in general not a desirable procedure. And matching is never a
substitute for randomization. Remember, too. that the correlation between the matching variable or variables
and the dependent variable must be substantial (greater than .50, say) to be productive, and that only the variable
or variables matched on —and, perhaps, variables substantially correlated with these variables —
are controlled.
A better procedure, in general, is analysis of covariance or other regression procedure, a subject we will examine
in later chapters.
310 • Designs of Research
Design 19.3 has many advantages and is frequently used. Its structure is similar to that of
Design 18.2, with two important differences: Design 18.2 lacks a control group and
randomization. Design 19.3 is similar to Designs 19. and 19.2. except that the "before"
1
or pretest feature has been added. It is used frequently to study change. Like Designs 19.1
and 19.2, it can be expanded to more than two groups.
In Design 19.3(a), subjects are assigned to the experimental group (top line) and the
control group (bottom line) at random and are pretested on a measure of Y. the dependent
variable. The investigator can then check the equality of the two groups on Y. The experi-
mental manipulation X is performed, after which the groups are again measured on Y. The
difference between the two groups is tested statistically. An interesting and difficult char-
acteristic of this design is the nature of the scores usually analyzed: difference, or change,
scores, Y„ — Y/, = D. Unless the effect of the experimental manipulation is strong, the
analysis of difference scores is not advisable. Difference scores are considerably less
reliable than the scores from which they are calculated.'- There are other problems. We
discuss only the main strengths and weaknesses.'' At the end of the discussion the ana-
lytic difficulties of difference or change scores will be taken up.
Probably most important. Design 19.3 overcomes the great weakness of Design 18.2,
because it supplies a comparison control group against which the difference, Y„ — Y/,, can
be checked. With only one group, we can never know whether history, maturation (or
both), or the experimental manipulation X produced the change in Y. When a control
group is added, the situation is radically altered. After all. if the groups are equated
(through randomization), the effects of history and maturation, if present, should be
present in both groups. If the mental ages of the children of the experimental group
increase, so should the mental ages of the children of the control group. Then, if there is
still a difference between the Y measures of the two groups, it should not be due to history
or maturation. That is, if something happens to affect the experimental subjects between
the pretest and the posttest. this something should also affect the subjects of the control
group. Similarly, the effect of testing —
Campbell's reactive measures should be con- —
trolled. For if the testing affects the members of the experimental group it should similarly
affect the members of the control group. (There is. however, a concealed weakness here,
which will be di.scussed later.) This is the main strength of the well-planned and well-
executed before-after, experimental-control group design.
On the other hand, before-after designs have a troublesome aspect, which decreases
the external validity of the experiment, although the internal validity is not affected. This
source of difficulty is the pretest. A on subjects. For
pretest can have a sensitizing effect
example, the subjects may possibly be alerted to certain events in their environment that
they might not ordinarily notice. If the pretest is an attitude scale, it can sensitize subjects
to the issues or problems mentioned in the scale. Then, when the X treatment is adminis-
tered to the experimental group, the subjects of this group may be responding not so much
to the attempted influence, the communication, or whatever method is used to change
attitudes, as to a combination of their increased sensitivity to the issues and the experi-
mental manipulation.
'"For a clear explanation of why this is so. see R. Thomdike and E. Hagen, Measuremeni uiid Evahuilion in
Psychology and Educalion. 4th ed. New York: Wiley. 1977. pp. 98-101. See, also, R. Thomdike. The Con-
cepts of Over- and Underachievemeni. New York: Bureau of Publications. Teachers College. Columbia Univer-
sity, 1963, pp. 39ff.; R. Thomdike, "Intellectual Status and Intellectual Growth," Journal of Educational
Psychology. 57 (1966), 121-127; and E. O'Connor. "Extending Classical Theory to the Measurement of
Change," Review of Educational Research. 42 1972). 73-97. O'Connor recommends a better statistical proce-
(
dure for handling change. We will outline the appropriate procedure later in the book in the context of multiple
regression.
"For a more complete discussion, see D. Campbell and J. Stanley. Experimental Designs and Quasi-Exper-
imental Designs for Research. Chicago. III.: Rand McNally. 1963, pp. 13ff.
General Designs of Research • 311
Since such interaction effects are not immediately obvious, and since they contain a
threat to the external validity of experiments, it is worthwhile to consider them a bit
further. One would think that, since both the experimental and the control groups are
pretested, the effect of pretesting, if any, would ensure the validity of the experiment. Let
us assume that no pretesting was done, that is. that Design 19.2 was used. Other things
equal, a difference between the experimental and the control groups after experimental
manipulation of X can be assumed to be due to X. There is no reason to suppose that one
group is more sensitive or more alert than the other, since they both face the testing
situation after X. But when a pretest is used, the situation changes. While the pretest
sensitizes both groups, it can make the experimental subjects respond to X, wholly or
partially, because of the sensitivit}'. What we have, then, is a lack of generalizability: it
may be possible to generalize to pretested groups but not to unpretested ones. Clearly such
a situation is disturbing to the researcher, since who wants to generalize to pretested
groups?
If this weakness is important, why is this a good design? While the possible interaction
effect described above may be serious in some research, it is doubtful that it strongly
affects much behavioral research, provided researchers are aware of its potential and take
adequate precautions. Testing is an accepted and normal part of many situations, espe-
cially in education. It is doubtful, therefore, that research subjects will be unduly sensi-
tized in such situations. Still, there may be times when they can be affected. The rule
Campbell and Stanley give is a good one; When unusual testing procedures are to be used,
use designs with no pretests.
Difference Scores
Look atDesign 19.3 again, particularly at changes between Y/, and Y„. One of the most
difficult problems that has plagued and intrigued — —
researchers, measurement special-
ists, and statisticians is how to study and analyze such difference, or change, scores. In a
book of the scope of this one, it is impossible to go into the problems in detail. General
precepts and cautions, however, can be outlined. One would think that the application of
analysis of variance to difference scores yielded by Design 19.3 and similar designs would
be effective. Such analysis can be done if the experimental effects are substantial. But
difference scores, as mentioned earlier, are usually less reliable then the scores form
which they are calculated. Real differences between experimental and control groups may
be undetectable simply because of the unreliability of the difference scores. To detect
differences between experimental and control groups, the scores analyzed must be reliable
enough to reflect the differences and thus to be detectable by statistical tests. Because of
this difficulty experts even say that difference or change scores should not be used. '"* So
what can be done?
The generally recommended procedure is to use so-called residualized or regressed
gain scores, which are scores calculated by predicting the posttest scores from the pretest
scores on the basis of the correlation between pretest and posttest, and then subtracting
these predicted scores from the posttest scores to obtain the residual gain scores. (The
reader should not be concerned if this procedure is not too clear at this stage. Later, after
we study regression and analysis of covariance, it should become clearer.) The effect of
the pretest scores is removed from the posttest scores; that is, the residual scores are
posttest scores purged of the pretest influence. Then the significance of the difference
'"See footnotes 12 and 13 and L. Cronbach and L. Furby. "How Should We Measure 'Cliange' — or Should
We?" Psychological Bulletin. 74 (1970), 68-80.
312 • Designs of Research
between the means of these scores is tested. All this can be accomplished by using either
the procedure just described and a regression equation or by analysis of covariance.
Even the use of residual gain scores and analysis of covariance is not perfect, how-
ever. If subjects have not been assigned at random to the experimental and control groups,
the procedure will not save the situation. When groups differ systematically before experi-
mental treatment in other characteristics pertinent to the dependent variable, statistical
manipulation does not correct such differences.'^ If, however, a pretest is used, use
random assignment and analysis of covariance, remembering that the results must always
be treated with special care. Finally, multiple regression analysis may provide the best
'^
solution of the problem, as we will see later.
S
The value of this design is doubtful, even though it is included among the adequate
designs. The scientific demand for a comparison is satisfied: there is a comparison group
(lower line). A major weakness of Design 18.3 (a pallid version of Design 19.4) is
remedied by the randomization. Recall that with Design 18.3 we were unable to assume
beforehand that the experimental and control groups were equivalent. Design 19.4 calls
for subjects to be assigned to the two groups at random. Thus, it can be assumed that they
are statistically equal. Such a design might be used when one is worried about the reactive
effect of pretesting, or when, due to the exigencies of practical situations, one has no other
choice. Such a situation occurs when one has the opportunity to try a method or some
innovation only once. To test the method's efficacy, one provides a base line forjudging
the effect of X on Y by pretesting a group similar to the experimental group. Then y„ is
tested against Y/,.
This design's validity breaks down if the two groups aie not both randomly selected
from the same population or if the subjects are not assigned to the two groups at random.
Even then, it has the weaknesses mentioned in connection with other similar designs,
namely, other possible variables may be influential in the interval between Yi, and Y„. In
other words. Design 19.4 is superior to Design 18.3, but it should not be used if a better
design is available.
Yf, X y„ (Experimental)
X y„ (Control 2)
'VftW., p. 78.
""It is unfortunate that the complexities of design and statistical analysis inay discourage the student, some-
times even to the point of feeling hopeless. But that is the nature of behavioral research: it merely reflects the
exceedingly complex character of psychological, sociological, and educational is at one and the
reality. This
same lime and exciting. Like marriage, behavioral research is difficult and often unsuccessful
frustrating but —
not impossible. Moreover, it is the only way to acquire reliable understanding of our behavioral world. The point
of view of this book is that we should learn and understand as much as we can about what we are doing, use
reasonable care with design and analysis, and then do the research without fussing too much about analytic
matters. The main thing is always the research problem and our interest in it. This does not mean a cavalier
disregard of analysis. It simply means reasonable understanding and care and healthy measures of both optimism
and skepticism.
General Designs of Research '313
(third line). (It seems a bit strange to have a control group with an .V, but the group of the
third line is really a control group.) With the Y„ measures of this group available, it is
possible to check the interaction effect. Suppose the mean of the experimental group is
significantly greater than the mean of the first control group (second line). We may doubt
whether this difference was really due to X. It might have been produced by increased
sensitization of the subjects after the pretest and the interaction of their sensitization and
A'. We now look at the mean of K„ of the second control group (third line). It, too, should
be significantly greater than the mean of the first control group. If it is, we can assume that
the pretest has not unduly sensitized the subjects, or that X is sufficiently strong to over-
ride a sensitization-A' interaction effect.
Yi, X Ya (Experimental)
Ya (Control 1)
E Y„ (Control 2)
~X Ya (Control 3)
'^Solomon, op. cii.. pp. 137-150. Although this design can have a matching fomi. it is not discussed here
nor is it recommended. The symbolism used above is not Solomon's.
'"D. Campbell, "Factors Relevant to the Validity of Experiments in Social Settings," Psychological Bulle-
tin.54(1957). 297-312. See, also, D. Campbell and J. Stanley. Experimental Designs and Quasi-Experimenial
Designs for Research Chicago: Rand McNally, 1963.
.
314 • Designs of Research
that is, with Design 19.3, one can subtract Yh from y,, or do an analysis of co variance.
With the two lines, one can test the ¥„'& against each other with a t test or F test, but the
problem is how to obtain one overall statistical approach. One solution is to test the y„'s of
Controls 2 and 3 against the average of the two y,,'s (the first two lines), as well as to test
the significance of the difference of the y„"s of the first two lines. In addition. Solomon
originally suggested a 2 x 2 factorial analysis of variance, using the four Y^ sets of
measures.'*^ Solomon's suggestion is outlined in Figure 19.5. A careful study will reveal
that this is a fine example of research thinking, a nice blending of design and analysis.
With this analysis we can study the main effects, X and ~X. and Pretested and Not
Pretested. What is more interesting, we can test the interaction of pretesting and X and get
a clear answer to the previous problem.
X ~x
Pretested K,,, Experimental Y^, Control 1
Not
Pretested Y„. Control 2 y,„ Control 3
Figure 19.5
While this and other complex designs have decided strengths, it is doubtful that they
can be used routinely. In fact, they should probably be saved for very important experi-
ments in which, perhaps, hypotheses already tested with simpler designs are again tested
with greater rigor and control. Indeed, it is recommended that designs like 19.5 and 19.6
and certain variants of Designs 19.6, to be discussed later, be reserved for definitive tests
of research hypotheses after a certain amount of preliminary experimentation has been
done.
~X
X
~x
The Ya's of the third and fourth lines are observations of the dependent variable at any
specified later date. Such an alteration, of course, changes the purpose of the design and
may cause some of the virtues of Design 19.6 to be lost. We might, if we had the time, the
patience, and the resources, retain all the former benefits and still extend in time by
adding two more groups to Design 19.6 itself.
Compromise Designs
It is possible, indeed necessary, to use designs that are compromises with true experimen-
tation. Recall that true experimentation requires at least two groups, one receiving an
experimental treatment and one not receiving the treatment or receiving it in different
form. The true experiment requires the manipulation of one independent variable,
at least
the random assignment of subjects to groups, and the random assignment of treatments to
groups. When one or more of these prerequisites is missing for one reason or another, we
have a compromise design. Although there are many possibilities, only one will be dis-
cussed at length below.
Yb X Y„ (Experimental)
Y,, ~X K„ (Control)
The difference between Designs 19.3 and 19.7 is sharp. In Design 19.7, there is no
randomized assignment of subjects to groups, as in 19.3(a), nor is there matching of
subjects and then random assignment, as in 19.3(b). Design 19.7, therefore, is subject to
the weaknesses due to the possible lack of equivalence between the groups in variables
other than X. Researchers commonly take pains to establish equivalence by other means,
and to the extent they are successful in doing so, to this extent the design is valid. This is
done in ways discussed below.
It is often difficult or impossible to equate groups by random selection or random
assignment, or by matching. Should one then give up doing the research? By no means.
Every effort should be made, first, to select and to assign at random. If both of these are
not possible, perhaps matching and random assignment can be accomplished. If they are
not, an effort should be made at least to use samples from the same population or to use
samples as alike as possible. The experimental treatments should be assigned at random.
Then the similarity of the groups should be checked using any information available
sex, age. social class, and so on. The equivalence of the groups should be checked using
the means and standard deviations of the pretests: i tests and F tests will do. The distribu-
tions should also be checked. Although one cannot have the assurance that randomization
gives, if these items all check one can go ahead with the study knowing at least that there
is no evidence against the equivalence assumption.
These precautions increase the possibilities of attaining internal validity. There are
still difficulties, all of which are subordinate to one main difficulty, called selection.
(These other difficulties will not be discussed here. For detailed discussion, see the Camp-
bell and Stanley chapter previously cited.)
Selection is one of the difficult and troublesome problems of behavioral research.
316 • Designs of Research
the volunteer subjects, also influences the dependent variable measures. This happens
even though the pretest may show the groups to be the same on the dependent variable.
The X manipulation is "effective," but it is not effective in and of itself. It is effective
because of selection, or self-selection.
Time Designs
y, Ki X K, Y4
Note the similarity to Design 18.2. where a group is compared to itself. The use of
Design 19.8 allows us to avoid one of the difficulties of Design 18.2. Its use makes it
possible to separate reactive measurement effects from the effect of X. It also enables us
to see. if the measurements have a reactive effect, whether X has an effect over and above
that effect. The reactive effect should show itself at Yj', this can be contrasted with Yt,- If
there is an increase at yj over and above the increase at Yj. it can be attributed to X. A
similar argument applies for maturation and history.
One difficulty with longitudinal or time studies, especially with children, is the growth
or learning that occurs over time. Children do not stop growing and learning for research
convenience. The longer the time period, the greater the problem. In other words, time
itself is a variable in a sense. With a design like Design 18.2. Yf,X K„. the time variable
General Designs of Research -317
Concluding Remarks
The designs of this chapter are general: they are stripped down to bare essentials to
show underlying structure. Having the underlying structures well in mind — cognitive
psychologists say that such structures are important in remembering and thinking — the
student is in a position to use the more specific designs of analysis of variance and related
paradigms. Knowing and understanding the general designs may enhance mental flexibil-
ity and the ability to cope conceptually and practically with research problems and the
design means of solving the problems.
Study Suggestions
1. The first sentence of this chapter is "Design is data discipline."' What does this sentence
mean? Justify it.
2. Suppose you are an educational psychologist and plan to test the hypothesis that feeding
back psychological information to teachers effectively enhances the children's learning by increas-
ing the teachers' understanding of the children. Outline an ideal research design to test this hypoth-
esis, assuming that you have complete command of the situation and plenty of money and help.
{These are important conditions, which are included to free the reader from the practical limitations
that so often compromise good research designs.) Set up two designs, each with complete randomi-
zation, both following the paradigm of Design 19.1. In one of these use only one independent
-"See J. Gottman, R. McFali. and J. Bamett, "Design and Analysis of Research Using Time Series,"
Psychological Bulletin. 11 (1969). 299-306; Campbell and Stanley, op. cil.. pp. 37-46.
318 • Designs of Research
variable and one-way analysis of variance. In the second, use two independent variables and a
simple factorial design.
How do these two designs compare in their control powers and in the information they yield?
Which one tests the hypothesis better? Why?
3. Design research to test the hypothesis of Study Suggestion 2, above, but this time compro-
mise the design by not having randomization. Compare the relative efficacies of the two ap-
proaches. In which of them would you put greater faith? Why? Explain in detail.
4. Suppose that a team of sociologists, psychologists, and educators believed that competent
and insightful counseling can change the predominantly negative attitudes of juvenile offenders for
the better. They took 30 juvenile offenders —
found to be so by the courts —
who had been referred
for counseling in the previous year and matched each of them to another nonoffender youngster on
sex and intelligence. They compared the attitudes of the two groups at the beginning and the end of
the year (the duration of the counseling), and found a significant difference at the beginning of the
year but no significant difference at the end. They concluded that counseling had a salutary effect on
the juvenile offenders' attitudes.
Criticize the research. Bring out its strengths and weaknesses. Keep the following in mind:
sampling, randomization, group comparability, matching, and control. Is the conclusion of the
researchers empirically valid, do you think? If not, outline a study that will yield valid conclusions.
5.The advice in the text not to use analysis of variance in nonexperimental research does not
apply so much to one-way analysis of variance as it does to factorial analysis. Nor does the problem
of equal numbers of cases in the cells apply (within reason). In a number of nonexperimental
studies, in fact, one-way analysis of variance has been profitably used. One such study is; S. Jones
and S. Cook, "The Influence of Attitude on Judgments of the Effectiveness of Alternative Social
Policies," Journal of Personality^ and Social Psychology. 32 (1975), 767-773. The independent
variable was attitude toward blacks, obviously not manipulated. The dependent variable was prefer-
ence for social policy affecting blacks; remedial action involving social change or action involving
self-improvement of blacks. One-way analysis of variance was used with social policy preference
scores of four groups differing in attitudes toward blacks. (Attitudes toward blacks were also meas-
ured with an attitude scale.)
It is suggested that students read and digest this excellent and provocative study. It will be time
and effort well-spent. You may also want to do an analysis of variance of the data of the authors"
Table 1, using the method outlined earlier of analysis of variance using n's, means, and standard
deviations (see Addendum, Chapter 13).
Chapter 20
Research Design
Applications:
Randomized Groups
It is difficult to tell anyone how to do research. Perhaps the best thing to do is to make
sure that the beginner has a grasp of principles and possibilities. In addition, approaches
and tactics can be suggested. In tackling a research problem, the investigator should let his
mind roam, speculate about possibilities, even guess the pattern of results. Once the
possibilities are known, intuitions can be followed and explored. Intuition and imagina-
tion, however, are not much help if we know little or nothing of technical resources. On
the other hand, good research is not just methodology and technique. Intuitive thinking is
essential because it helps researchers arrive at solutions that are not merely conventional
and routine. It should never be forgotten, however, that analytic thinking and creative
intuitive thinking both depend on knowledge, understanding, and experience.
The main purposes of this chapter and the next are to enrich and illustrate our design
and statistical discussion with actual research examples and to suggest basic possibilities
for designing research so that the student can ultimately solve research problems. Our
summary purpose, then, is to supplement and enrich earlier, more abstract design and
statistical discussions.
m
320 • Designs of Research
Research Examples
The simplest form of Design 19.1 is a one-way analysis of variance paradigm in which k
groups are given k experimental treatments and the k means are compared with analysis of
variance or separate tests of significance. A glance at Figure 19.3, left side, shows this
simple form of 19. 1 with ^- = 3. Strange to say, it is not used too often, researchers more
often preferring the factorial form of Design 19.1. Two one-way examples are given
below. One used random assignment; one probably did not. Unfortunately, some re-
searchers do not report how subjects were assigned to groups or treatments. The need to
report on method of subject selection and assignment to experimental groups should by
now be obvious.
write the names of birds and other animals. All subjects were then shown, on slides, the
names of 16 presidents and 16 states in random order. They were instructed to write as
many of the names as they could remember.
If mobilization was effective, the subjects of the presidents-mobilization group should
recall more presidents, and the subjects of the states-mobilization group should remember
more states. There should be no effect, of course, on the control group members. The
mean numbers of recalled presidents and recalled states in the three mobilization condi-
tions are given in Table 20. One-way analyses of variance of each of the sets of three
1 .
means yielded significant F ratios, and the patterns of the means were as predicted: the
presidents-mobilized group had the highest mean (7.77) in recall of presidents, and the
states-mobilized group had the highest mean (8.50) in recall of states. Post hoc tests
indicated that presidents-mobilized subjects recalled significantly more presidents than the
Mobilization Treatment
Recall of Presidents States Control F
'The italicized means show the crucial predictions: If presidents mobilized, then
better recall of presidents, and if states mobilized, then better recall of states.
'J. Peeck. "Effects of Mobilization of Prior Knowledge on Free Recall." Journal of Experimental Psychol-
ogy: Learning. Memory, and Cognition. 8 (1982), 608-612.
Research Design Applications: Randomized Groups • 321
other two groups, and states-mobilized subjects recalled more states than the other two
groups. Peeck also did a second experiment in which the recall test was delayed 24 hours.
The results were similar. Mobilization seems to be an eft'ective means of enhancing recall.
Students should especially note the use of the same design (19.1) and analyses of variance
with two different categories of recall. One is considerably more convinced by such varied
replication than one would be if only one recall task had been used.
This ingenious and important study of stimulus and response generalization (roughly,
spread of effectiveness of stimuli to elicit the same or similar responses to other related
stimuli) deserves careful study not only because of a creative use of conditioning and a
clever use of analysis of variance, but also for its implications for teaching." The study is
also noteworthy because randomization was evidently not used. It was probably deemed
unnecessary. Why? With any type of human response, especially physiological response,
that is universal to homo sapiens and that does not exhibit a wide range of individual
differences, it is sometimes fairly safe not to randomize. If all people are pretty much alike
in a characteristic under study, it obviously makes no difference whether randomization is
used. In this respect, any single individual is representative of the whole human race. The
possession of blood, a heart beat, and lungs are examples. Of course, the type of blood
and the rate of the heart beat and lung action, when used as variables, radically change the
picture. At any rate, in Wickens' study, it was probably assumed that all subjects are
conditionable. Still, it would have been better to assign subjects to groups at random,
because we know that there are individual differences in conditioning or conditionability.
It is conceivable that such differences might affect the experimental outcomes.
FACTORIAL DESIGNS
The basic general design is still Design 19.1, though the variation of the basic experimen-
tal group-control pattern is drastically alteredby the addition of other experimental factors
A|fi| and A^Si. and then females to the cells A^Bi and A2B2-
We can often improve the design and increase the information obtained from a study by
adding groups. Instead of A| and A2. and 6| and B2, an experiment may profit from A,,
At, A3, and A4, and 61, Bj, and B3. Practical and statistical problems increase and some-
times become quite difficult as variables are added. Suppose we have a 3 x 2 x 2 design
that has 3 X 2 x 2 = 12 cells, each of which has to have at least two subjects, and
preferably many more. (It is possible, but not very sensible, to have only one subject per
cell if one can have more than one. There are, of course, designs that have only one
subject per cell.) If we decide that 10 subjects per cell are necessary, 12 x 10 = 120
subjects will have to be obtained and assigned at random. The problem is more acute with
one more variable and the practical manipulation of the research situation is also more
difficult. But the successful handling of such an experiment allows us to test a number of
hypotheses and yields a great deal of information. The combinations of three-, four-, and
five-variable designs give a wide variety of possible designs: 2x5x3, 4x4x2,
3 X 2 X 4 X 2, 4 X 3 X 2 X 2, and so on.
Research Design Applications: Randomized Groups • 323
Examples of two- and three-dimensional factorial designs were described in Chapter 14.
(The restudy of these examples is recommended, because the reasoning behind the essen-
tial design can now be more easily grasped.) Since a number of examples of factorial
designs were given in Chapter 14, we confine the examples given here to three studies
with unusual features.
Flowers: Groitpthink
Table 20.2 Mean Numbers of Solutions Proposed, (a), and Emergent Facts, (b). Flowers Study^
(a)
Open Closed
High
Coh. 6.70
Low
Coh.
324 • Designs of Research
The student should particularly note that group measures were analyzed. Also note the
use of two dependent variables and two analyses of variance. That the main effect of open
and closed leadership was significant with number of proposed solutions and with facts
used is much more convincing than if only one of these had been used. An interesting and
potentially important experiment! Indeed, Flowers' operationalization of Janis' ideas of
groupthink and its consequences is a good example of experimental testing of complex
social ideas. It is also a good example of replication and of Design 19.1 in its simplest
factorial form.
We now outline an educational study done many years ago because it was planned to
answer an important theoretical and practical question and because it clearly illustrates a
Table 20.3
Research Design Applications: Randomized Groups • 325
complex factorial design.*' The research question was: What are the effects on the achieve-
ment and attitudes of pupils if teachers are given knowledge of the characteristics of their
pupils? Hoyt's study explored several aspects of the basic question and used factorial
design to enhance the internal and external validity of the investigation. The first design
was used three times for each of three school subjects, and the second and third were used
twice, once in each of two school systems.
The paradigm for the first design is shown in Figure 20.1. The independent variables
were treatments, ability, sex, and schools. The three treatments were no information {N),
test scores (7), and test scores plus other information (TO). These are self-explanatory.
Ability levels were high, medium, and low IQ, The variables sex and schools are obvious.
Eighth-grade students were assigned at random within sex and ability levels. It will help
us understand the design if we examine what a final analysis of variance table of the
design looks like. Before doing so, however, it should be noted that the achievement
results were mostly indeterminate (or negative). The F ratios, with one exception, were
not significant. Pupils" attitudes toward teachers, on the other hand, seemed to improve
with increased teacher knowledge of pupils, an interesting and potentially important find-
ing. The analysis of variance table is given in Table 20.4. One experiment yields 14 tests!
Naturally, a number of these tests are not important and can be ignored. The tests of
greatest importance (marked with asterisks in the table) are those involving the treatment
variable. The most important test is between treatments, the first of the main effects.
Perhaps equally important are the interactions involving treatments. Take the interaction
treatments x sex. If this were significant, it would mean that the amount of information a
teacher possesses about students has an influence on student achievement, but boys are
intluenced differently than girls. Boys with teachers who possess information about their
pupils may do better than boys whose teachers do not have such information, whereas it
may be the opposite with girls, or it may make no difference one way or the other.'
Second-order or triple interactions are harder to interpret. They seem to be rarely
significant. If they are significant, however, they require special study. Crossbreak tables
326 • Designs of Research
Source df
Main Effects:
*Between Treatments 2
Between Ability Levels 2
Between Sexes 1
Between Schools 1
First-Order Interactions:
*Interaction: Treatments x Ability 4
*lnteraction: Treatments x Sex 2
*Interaction: Treatments x School 2
Interaction: Ability x Sex 2
Interaction: Ability x School 2
Interaction: Sex x School 1
Second-Order Interactions:
*Interaction: Treatments x Ability x Sex 4
*Interaction: Treatment x Ability x School 4
Interaction: Ability x Sex x School 2
Third-Order Interaction:
Interaction: Treatment x Ability x Sex x School 4
Within
Total
of the means are perhaps the best way, but graphic methods, as discussed earlier, are often
enlightening. The student will find guidance in Edwards' book.*
corporated into factorial designs and thus controlled. With factorial designs, too, it is
possible to have mixtures of active and attribute variables, another important need.
There are also weaknesses, One criticism has been that randomized subjects designs
do not permit tests of the equality of groups as do before-after designs. Actually, this is
not a valid criticism for two reasons: with enough subjects and randomization, it can be
assumed that the groups are equal, as we have seen: and it is possible to check the groups
for equality on variables other than Y, the dependent variable. For educational research,
data on intelligence, aptitude, and achievement, for example, are available in school
records. Pertinent data for sociology and political science studies can often be found in
county and election district records.
Another difficulty is statistical. One should have equal numbers of cases in the cells of
factorial designs. (It is possible to work with unequal /;"s, but it is both clumsy and a threat
to interpretation. Small discrepancies can be cured by dropping out cases at random.) This
imposes a limitation on the use of such designs, because it is often not possible to have
equal numbers in each cell. One-way randomized designs are not so delicate: unequal
numbers are not a difficult problem.''
Compared to matched groups designs, randomized subjects designs are usually less
is, the error term is ordinarily larger, other things equal. It is doubtful,
precise, that
however, whether this is cause for concern. In some cases it certainly is for example, —
where a very sensitive test of a hypothesis is needed. In much behavioral research,
though, it is probably desirable to consider as nonsignificant any effect that is insuffi-
ciently powerful to make itself felt over and above the random noise of a randomized
subjects design.
All in all, then, these are powerful, flexible, useful, and widely applicable designs. In
the opinion of the writer, they are the best all-round designs, perhaps the first to be
considered when planning the design of a research study.
Study Suggestions
(a) Draw three groups of random numbers. through 9. Name the independent and depend-
ent variables. Express a hypothesis and translate it into design-statistical language. Do a
one-way analysis of variance. Interpret.
(b) Repeat 1(a) with five groups of numbers.
(c) Now numbers of one of your groups by 2, and decrease those of another
increase the
group by Repeat the statistical analysis.
2.
(d) Draw four groups of random numbers. 10 in each group. Set them up. at random, in a
2x2 factorial design. Do a factorial analysis of variance.
'How to adjust and analyze data for unequal n's is a complex, thorny, and much-argued problem. For a
discussion in the context mostly of analysis of variance, see G. Snedecor and W. Cochran. Statistical Methods,
6th ed. Ames. Iowa: Iowa State University Press. 1967. chap. 16. Discussion in the context of multiple regres-
sion, which is actually a better solution of the problem, can be found
in: F. Kerlinger and E. Pedhazur. Multiple
Regression Behavioral Research. New York: Holt, Rinehart and Winston. 1973, pp. 140-151. 187-197; and
in
E. Pedhazur. Multiple Regression in Behavioral Research: Explanation and Prediction. 2d ed. New York: Holt,
Rinehan and Winston. 1982. pp. 316-323. 371-373. Pedhazur's discussions are detailed and authoritative. He
reviews the issues and suggests solutions.
328 • Designs of Research
(e) Bias the numbers of the two right-hand cells by adding 3 to each number. Repeat the
analysis. Compare with the results of 1(d).
(f Bias the numbers of the data of 1(d), as follows: add 2 to each of the numbers in the
upper left and lower right cells. Repeat the analysis. Interpret.
2. Look up Study Suggestion 2 and 3, Chapter 14. Work through both examples again. (Are
they easier for you now?)
3. Suppose that you are the principal of an elementary school. Some of the fourth- and fifth-
grade teachers want to dispense with workbooks. The superintendent does not like the idea, but he
is willing for you workbooks do not make much difference. (One of the
to test the notion that
workbooks may have bad effects on both teachers and pupils.) Set up
teachers even suggests that
two research plans and designs to test the efficacy of the workbooks; a one-way design and a
factorial design. Consider the variables achievement, intelligence, and sex. You might also consider
the possibility of teacher attitude toward workbooks as an independent variable.
4. Study Table 20. 1, the data of the Peeck study. Peeck used one-way analyses of variance, one
analysis for the recall measures of Presidents, the other for the recall measures of States. He was of
course testing the significance of the mean recall differences among the mobilization conditions.
Could he have used a 3 x 2 factorial design? If so, what would the advantages be. if any? In Table
20. 1 . why were
the two means italicized?
Suppose an investigation using methods and sex as the independent variables and achieve-
5.
ment as the dependent variable has been done with the results shown in Table 20.5. The numbers in
the cells are fictitious means. The F ratios of methods and sex are not significant. The interaction F
ratio is significant at the .01 level. Interpret these results statistically and substantively. To do the
latter, name the three methods.
Table 20.5
Chapter 2 i
Research Design
Applications:
Correlated Groups
paradigm is given in Figure 21.1. To emphasize the sources of variance, means of col-
umns and rows have been indicated. The individual dependent variable measures (^'s)
'The word "group" should be taken to mean set of scores. Then there is no confusion when a repeated
trials experiment is classified as a multigroup design.
330 • Designs of Research
Treatments
Units
Research Design Applications: Correlated Groups • 331
332 • Designs of Research
This design has two forms, the better of which (repeated here) was described in Chapter
19 as Design 19.2:
—
M,
X Y (Experimental)
~X y (Control)
In this design, subjects are first matched and then assigned to experimental and control
groups at random. In the other form, subjects are matched but not assigned to experimen-
tal and control group s at random. The latter design can be indicated by simply dropping
the subscript r from M, (described in Chapter 18 as Design 18.4, one of the inadequate
designs).
The paradigm of this warhorse of designs is shown in Figure 21.3.
design-statistical
The symbols for the means shows the two sources of systematic variance:
insertion of the
treatments and pairs, columns and rows. This is in clear contrast to the randomized
designs of Chapter 20, where the only systematic variance was treatments or columns.
The most common variant of the two-group, experimental group-control group design
is the before-after, two-group design. [See Design 19.3 (b).] The design-statistical para-
Thomdike used an ingenious device to separate the differential effect of each school
subject by matching on Form A of the intelligence test those pupils who studied, for
instance, English, history, geometry, and Latin with those pupils who studied English,
history, geometry, and sliopwork. Thus, for these two groups, he was comparing the
differential effects oi Latin and sliopwork. Gains in final intelligence scores were consid-
ered a joint effect of growth plus the academic subjects studied.
Despite its weaknesses, this was a colossal study. Thomdike was aware of the lack of
adequate controls, as revealed in the following passage on the effects of .selection:
The chief reason why good seem superficially to have been made such by having taken
thinkers
certain school studies, isgood thinkers have taken such studies.
that When the good
. . .
thinkers studied Greek and Latin, these studies seemed to make good thinkers. Now that the
good thinkers study Physics and Trigonometry, these seem to make good thinkers. If the abler
pupils should all study Physical Education and Dramatic Art, these subjects would seem to
make good thinkers.''
Thomdike pointed the way to controlled educational research, which has led to the de-
crease of metaphysical and dogmatic explanations in education. His work struck a blow
against the razor-strop theory of mental training, the theory that likened the mind to a
razor that could be sharpened by stropping on "hard" subjects.
it
It is not easy to evaluate a study such as this, the scope and ingenuity of which is
impressive. One wonders, however, about the adequacy of the dependent variable, "in-
telligence" or "intellectual ability." Can school subjects studied for one year have much
effect on intelligence? Moreover, the study was not experimental. Thomdike measured
the intelligence of students and let the independent variables, school subjects, operate. No
randomization, of course, was possible. As mentioned above, he was aware of this control
weakness in his study, which is still a classic that deserves respect and careful study
despite its weaknesses in history and selection (maturation was controlled).
In Study Suggestion 4, Chapter 15. we presented data from one of the set of remarkable
studies of the learning of autonomic functioning done by Miller and his colleagues.'' It has
been believed by experts and laymen that it is not possible to leam and control responses
of the autonomic nervous system. That is, glandular and visceral responses heart beat, —
urine secretion, and blood pressure, for example —
were supposed to be beyond the "con-
trol" of the individual. Miller believed otherwise. He demonstrated experimentally that
'E. Thomdike, "Mental Discipline in High School Studies," Journal of Educational Psychology. 15
(1924). 1-22, 83-98.
"Ibid., p. 98.
'N. Miller, "Learning of Visceral and Glandular Responses," Science. 163 (1969), 434-445. Miller
reports anumber of these studies in: N. Miller, Selected Papers. New York: Aldine, 1971 Part XI. The study
,
now to be reported is: N. Miller and L. DiCara, "Instrumental Learning of Urine Formation by Rats; Changes in
Renal Blood Flow," American Journal of Physiology. 215 (1968). 677-683.
334 • Designs of Research
such responses are subject to instrumental learning. The crucial part of his method con-
sisted of rewarding visceral responses when they occurred. In the study whose data were
cited in Chapter 15, for example, rats were rewarded when they increased or decreased the
secretion of urine. Fourteen rats were assigned at random to two groups called "Increase
Rats" and "Decrease Rats." The rats of the former group were rewarded with brain
stimulation (which was shown to be effective for increases in urine secretion), while the
rats of the latter group were rewarded for decreases in urine secretion during a "training"
period of 220 trials in approximately three hours.
To show part of the experimental and analytic paradigms of this experiment, the data
before and after the training periods for the Increase Rats and the Decrease Rats are given
in Table 21.1 (taken from the authors' Table 1). The measures in the table are the millili-
ters of urine secretion per minute per 100 grams of body weight. Note that they are very
small quantities. The research design is a variant of Design 19.3 (a):
Ya (E)
~X Ya (C)
The difference is that ~X, which in the design means absence of experimental treatment
for the control group, now means reward for decrease of urine secretion. The usual
analysis of the after-training measures of the two groups is therefore altered.
Wecan better understand the analysis if we analyze the data of Table 21.1 somewhat
differently than Miller and DiCara did. (They used t tests.) I did a two-way (repeated
measures) analysis of variance of the Increase Rats data. Before and After, and the
Decrease Rats data. Before and After. The Increase Before and After means were .017
and .028, and the Decrease means were .020 and .006. The Increase F ratio was 43.875
{df= 1,6); the Decrease F was 46.624. Both were highly significant. The two Before
means of .017 and .020 were not significantly different, however. In this case, compari-
son of the means of the two After groups, the usual comparison with this design, is
probably not appropriate because one was for increase and the other for decrease in urine
secretion.
This whole study, with its highly controlled experimental manipulations and its "con-
trol" analyses, is a lovely example of imaginative conceptualization and disciplined com-
Table 21.1 Secretion of Urine Data, Miller and DiCara Study; Increase Rats and Decrease
Rats, Before and After Training
Research Design Applications: Correlated Groups • 335
petent analysis. The above analysis is one example. But the authors did much more. For
example, to be more sure that the reinforcement affected only urine secretion, they com-
pared the Before and After heart rates (beats per minute) of both the Increase and the
Decrease rats. The means were 367 and 412 for Increase rats, and 373 and 390 for the
Decrease rats. Neither difference was statistically significant. Similar comparisons of
blood pressure and other bodily functions were not significant.
Students will do well to study this fine example of laboratory research until they
clearly understand what was done and why. It will help students learn more about con-
trolled experiments, research design, and statistical analysis than most textbook exercises.
It is a splendid achievement!
the multiple-model condition observed boys and girls interacting positively with different
dogs. Children in the control condition were shown other unrelated movies. The avoid-
ance behavior was tested after the treatment (posttest), and again in a later follow-up.
Analyses used in the study were unusual: because of nonnormality of the score distri-
butions, nonparametric tests were used. First, to test the trend effect —
that is, the differ-
ences among the scores of the pretest, posttest, and follow-up —
a Friedman nonparamet-
ric analysis was used. Xr^ was 15.79, significant at the .001 level. Recall that the
Friedman account of the correlation. Here, the differences among the experi-
test takes
mental treatments of the same children were tested. Thus, there was significant change.
Other tests showed that the two modeling conditions changed, but the control condition
did not change. The significance of the differences among the three experimental treat-
ments was tested using the Kruskal-Wallis one-way nonparametric analysis of variance.
Change or difference scores obtained between pretest and follow-up were analyzed. H
was 5.01 , significant at the .05 level. Other analyses showed that the modeling treatments
were effective.
The authors then added an ingenious twist to the study: they used the multiple-model-
ing procedure with the control group 5's at the end of the experiment proper. These S's
already had three approach scores: pretest, posttest, follow-up. Call the new fourth set of
scores posttherapy. The significance of the differences among the four sets of scores
note carefully that each S had four scores and thus we have correlated groups was tested —
with the Friedman test. Xi^ was 13.42, significant at the .01 level. Thus the modeling
procedure was also effective with the controls.
To be quite clear on what was done, the three principal analyses used by Bandura and
Menlove are summarized in Table 21.2, using the symbolism already familiar to us or
self-evident. Recall that the Kruskal-Wallis test is a one-way analysis and uses ranks of a//
''A. Bandura and F. Menlove, "Factors Determining Vicarious Extinction of Avoidance Behavior through
Symbolic Modeling," Journal of Personality and Social Psychology. 8 (1968). 99-108.
336 • Designs of Research
Table 21.2 Paradigmatic Outline of Analyses of Bandura and Menlove Modeling Study
/.
Research Design Applications: Correlated Groups • 337
Methods (Treatments)
Units A, A2 A3
2
Bi 3
4
^^"^'^
5 Y Means
(Devices,
or
'yp^^-
Measures
etc.)
J
2
B2 3
4
5
Figure 21.4
'E. Lindquist, Statistical Analysis in Educational Research. Boston: Houghton Mifflin, 1940.
338 • Designs of Research
and measure variances and to test interactions. Note that the two main sources of variance,
treatments (A) and levels (B), and the units variance can be evaluated; that is, the differ-
ences between the A, B, and units means can be tested for significance. In addition, three
interactions can be tested: treatments by levels, treatments by units, and levels by units. If
individual scores are used in the cells instead of means, the triple interaction, too, can be
tested. Note how important such interaction can be, both theoretically and practically. For
example, questions like the following can be answered: Do treatments work differently in
different units? Do certain methods work differently at different intelligence levels or with
different sexes or with children of different socioeconomic levels'?*
Suedfeld and Rank, in a study mentioned earlier in another context, tested the intriguing
notion that successful revolutionary leaders —
Lenin, Cromwell, Jefferson, for example
are conceptually simple in their public communications before revolution and conceptu-
ally complex after revolution.' Unsuccessful revolutionary leaders, on the other hand, do
not differ in conceptual complexity before and after revolution. The problem lends itself
to a factorial design and to repeated measures analysis. The design and the data on
conceptual complexity are shown in Table 21.3. It can be seen that the successful leaders
—
became conceptually more complex from 1.67 to 3.65 but unsuccessful leaders did —
not change —
2.37 and 2.22. The interaction F ratio was 12.37, significant at the .005
level. The hypothesis was supported.
A few points should be picked up. One, note the effective combining of factorial
design and repeated measures. When appropriate, as in this case, the combination is
Table 21.3 Factorial Design with Repeated Measures: Suedfeld and Rank
Study of Revolutionary Leaders"
Pretakeover Posttakeover
1.96 3.05
The advanced student will want to know how to handle units (schools, classes, etc.) and units variance in
factorial designs. Detailed guidance is given in A. Edwards, Experimental Design in Psychological Research,
4th ed. New York: Holt, Rinehart and Winston, 1972, chap. 14, and in R. Kirk, Experimental Design: Proce-
dures for the Behavioral Sciences. Belmont, Calif.; Brooks/Cole, 1968, pp. 229-244 and chap. 8. The subject is
difficult. Even the names of the designs become complex: randomized blocks, nested treatments, split-plot
designs. Such designs are powerful, however: they combine virtues of factorial designs and of correlated groups
designs. When needed, Edwards and Kirk are good guides. It is suggested, in addition, that help be solicited
from someone who understands both statistics and behavioral research. It is unwise to use computer programs
because their names seem appropriate. It is also unwise to seek analytic help from computer personnel. One
cannot expect such people to know and understand, say. factorial analysis of variance. That is not their job.
More will be said about computer analysis in later chapters.
''p. Suedfeld and A. Rank, "Revolutionary Leaders; Long-Term Success as a Function of Changes in
Conceptual Complexity," Journal of Personality' and Social Psychology. 34 (1976), 169-178.
Research besign Applications: Correlated Groups • 339
study was noncxperimcntal: no experimental variable was manipulated. Three and most
important, the intrinsic interest and significance ot the research problem and its theory and
ANALYSIS OF COVARIANCE
The invention of the analysis of covariance by Ronald Fisher was an important event in
behavioral research methodology. Here is a creative use of the variance principles com-
mon to experimental design and to correlation and regression theory — which we study
later in the book — to help solve a long-standing control problem.
Analysis of covariance is a form of analysis of variance that tests the significance of
the differences among means of experimental groups after taking into account initial
differences among the groups «/!(/the correlation of the initial measures and the dependent
variable measures. That is, analysis of covariance analyzes the differences between exper-
imental groups on Y. the dependent variable, after taking into account either initial differ-
ences between the groups on Y (pretest), or differences between the groups in some
pertinent independent variable or variables, X, substantially correlated with Y. the depend-
ent variable. The measure used as a control variable — the pretest or pertinent variable — is
called a covariate.
There is little point to describing the statistical procedures and calculations of analysis of
covariance. First, in their conventional form, they are complex and hard to follow. Sec-
ond, we wish here only to convey the meaning and purpose of the approach. Third and
most important, there is a much easier way to do what analysis of covariance does. Later
in the book we will see that analysis of covariance is a special case of multiple regression
and is much easier to do with multiple regression. To give the reader a feeling for what
analysis of covariance accomplishes, let us look at an effective use of the procedure in an
educational study."
Clark and Walberg thought that their subjects, potential school dropouts doing poorly
in school, needed far more reinforcement (encouragement, reward, etc.) than subjects
doing well in school. So they used massive reinforcement with their experimental group
subjects and moderate reinforcement with their control group subjects. Since their de-
pendent variable, reading achievement, is substantially correlated with intelligence, they
also needed to control intelligence. A one-way analysis of variance of the reading
achievement means of the experimental and control groups yielded an F of 9.52, signifi-
cant at the .01 level, supporting their belief. It is conceivable, however, that the difference
between the experimental and control groups was due to intelligence rather than to rein-
'°The above sentence, for instance, may be incongruent with the use of variables in this study. Suedfeld and
Ranic analyzed measures of the independent variable, conceptual complexity. But the hypothesis under study
was actually: If conceptual complexity (after revolution), then successful leadership. But with a research prob-
lem of such compelling interest and a variable of such importance (conceptual complexity) imaginatively and
competently measured, who wants to quibble? (The logical difficulty is similar to that of the salt-lasting study of
McGee and Snyder cited in Chapter 12 in which salting food was the dependent variable, but the researchers
analyzed measures of the independent variable, attribution of dispositional and environmental factors.)
"C. Clark and H. Walberg, "The Influence of Massive Rewards on Reading Achievement in Potential
School Dropouts," American Educational Research Journal. 5 (1968), 305-310.
340 • Designs of Research
Experimental Control
(Massive Reinforcement) (Moderate Reinforcement)
X Y X Y
(Intelligence) (Reading) (Intelligence) (Reading)
forcement. That is, even though the 5's were assigned at random to the experimental
groups, an initial may have
difference in intelligence in favor of the experimental group
been enough to make the experimental group reading mean significantly greater than the
control group reading mean, since intelligence is substantially correlated with reading.
With random assignment, it is unlikely to happen, but it can happen. To control this
possibility. Clark and Walberg used analysis of covariance.
Study Table 21.4, which shows in outline the design and analysis. The means of the X
and Y scores, as reported by Clark and Walberg. are given at the bottom of the table. The
Y means are the main concern. They were significantly different. Although it is doubtful
that the analysis of covariance will change this result, it is possible that the difference
between the X means, 92.05 and 90.73, may have tipped the statistical scales, in the test
of the difference between the Y means, in favor of the experimental group. The analysis of
covariance F test, which uses Y sums of squares and mean squares purged of the influence
of X, was significant at the .01 level: F = 7.90. Thus the mean reading scores of the
experimental and control groups differed significantly, after being adjusted or controlled
for intelligence.
hoped that narrowly circumscribed notions of doing research with, say, only one experi-
mental group and one control group, or with matched subjects, or with one group, before
and after, may be widened. The second objective was to convey a sense of the balanced
structure of good research designs, to develop sensitive feeling for the architecture of
design. Design must be formally as well as functionally fitted to the research problems we
seek to solve. The third objective was to help the reader understand the logic of experi-
mental inquiry and the logic of the various designs. Research designs are alternative
routes to the same destination: reliable and valid statements of the relations among
variables. Some designs, if feasible, yield stronger relational statements then other de-
signs.
In a certain sense, the fourth objective of Part Six has been the most difficult to
achieve: to help the student understand the relation between the research design and
statistics. Statistics is, in one sense, the technical discipline of handling variance. And. as
we have seen, one of the basic purposes of design is to provide control of systematic and
error variances. This is the reason for treating statistics in such detail in Parts Four and
Research Design Applications: Correlated Groups • 341
Five before considering design in Part Six. Fisher expresses this idea succinctly when he
says, "Statistical procedure and experimental design are only two dilTcrent aspects ot the
same whole, and whole comprises all the logical requirements of the complete proc-
that
ess of adding to natural knowledge by experimentation."'-
A well-conceived design is no guarantee of the validity of research findings. Elegant
designs nicely tailored to research problems can still result in wrong or distorted conclu-
sions. Nevertheless, the chances of arriving at accurate and valid conclusions are better
with sound designs than with unsound ones. This is relatively sure: if design is faulty, one
can come to no clear conclusions. If, for instance, one uses a two-group, matched-sub-
jects design when the research problem logically demands a factorial design, or if one uses
a factorial design when the nature of the research situation calls for a correlated-groups
design, no amount of interpretative or statistical manipulation can increase confidence in
the conclusions of such research.
It is fitting that Fisher should have the last word on this subject. In the first chapter of
his book. The Design of Experiments, he said:
If the design of an experiment is faulty, any method of interpretation which makes it out to be
decisive must be faulty too. It is many experimental procedures which
true that there are a great
are well designed in that they may lead to decisive conclusions, but on other occasions may fail
to do so; in such eases, if decisive conclusions are in fact drawn when they are unjustified, we
may say that the fault is wholly in the interpretation, not in the design. But the fault of inter-
pretation lies in overlooking the characteristic features of the design which lead to the
. . .
resultbeing sometimes inconclusive, or conclusive on some questions but not on all. To under-
stand correctly the one aspect of the problem is to understand the other.''
Study Suggestions
1. Can memory be improved by training? William James, the great American psychologist and
philosopher, did a memory experiment on himself.''* He first learned 158 lines of a Victor Hugo
poem, which took him 1311 minutes. This was his baseline. Then he worked for 20-odd minutes
daily, for 38 days, learning the entire first book of Paradise Lost. (Book 1 is 22 tightly printed pages
of rather difficult verse!) This was training of his memory. He returned to the Hugo poem and
learned 158 additional lines in 15 Ij minutes. Thus he took longer after the training than before. Not
satisfied, he had others do similar tasks —
with similar results.
On the basis of this work, what conclusions could James come to? Comment on his research
design. What design among those in this book does his design approximate?
2. In the Miller and DiCara study ouUined in this chapter, the authors did parallel analyses. In
addition to their analyses of urine secretion, for example, they analyzed heart beat rate and blood
pressure. Why did they do this?
In her classic study of "natural categories," Rosch replicated the original study of colors with
forms (square, circle, etc.).''' What advantage is there in such replication?
3. I did a two-way (repeated measures) analysis of variance of the Miller and DiCara Increase
Rats data of Table 21.1, with some of the results reported in the table. ai- (Hays omega-squared)
was .357. ft)- for the Decrease Rats data was .663. What do these coefficients mean? Why calculate
them?
4. Kolb, basing his work on the outstanding work of McClelland on achievement motivation,
did a fascinating experiment with underachieving high school boys of high intelligence."' Of 57
'-R. Fislier. The Design of Experiments. 6th ed. New York: Hafner. 1951, p. 3.
'^Ibid.. pp. 2-3.
'•W. James. The Principles of Psychology. New York: Holt. 1890. pp. 666-667.
"E. Rosch, Natural Categories," Cognitive Psychology. 4 (1973). 328-350.
'"D. Kolb. "Achievement Motivation Training for Underachieving High-School Boys." Journal of Person-
ality and Social Psychology. 2 (1965). 783-792^
342 • Designs of Research
boys, he assigned 20 at random to a training program in which, through various means, the boys
were "taught" achievement motivation (an attempt to build a need to achieve into the boys). The
boys were given a pretest of achievement motivation in the summer, and given the test again six
months later. The mean change scores were, for experimental and control groups, respectively,
6.72 and -.34, significant at the .005 level.
(a) Comment on the use of change scores. Does their use lessen our faith in the statistical
significance of the results'?
(b) Might factors other than the experimental training have induced the change?
5. Lest the student believe that only continuous measures are analyzed and that analysis of
variance alone is used in psychological and educational experiments, read the study by Freedman et
al. on and compliance." There was an experimental group (5's induced to lie) and a control
guilt
group, and the dependent variable was measured by whether a 5 did or did not comply with a request
for help. The results were reported in crossbreak frequency tables.
Read the study, and, after studying the authors" design and results, design one of the three
experiments another way. Bring in another independent variable, for instance. Suppose that it was
known that there were wide individual differences in compliance. How can this be controlled?
Name and describe two kinds of design to do it.
6. One useful means of control by matching is to use pairs of identical twins. Why is this
method a useful means of control? If you were setting up an experiment to test the effect of
environment on measured intelligence and you had 20 pairs of identical twins and complete experi-
mental freedom, how would you set up the experiment?
7. In a study in which training on the complexities of art stimuli affected attitude toward music,
among other things, Renner used analysis of covariance, with the covariate being measures from a
scale to measure attitude toward music. '* This was a pretest. There were three experimental groups.
Sketch the design from this brief description. Why did Renner use the music attitude scale as a
pretest? Why did she use analysis of covariance? {Note: The original report is well worth reading.
The study, in part a study of creativity, is itself creative.)
8. on complex concept formation,
In a significant study of the effect of liberal arts education
Winter and McClelland found the difference between senior and freshmen of a liberal arts college on
a measure of complex concept formation to be statistically significant (Mj = 2.00, Mf= 1.22;
/ = 3.76 (p < .001))." Realizing that a comparison was needed, they also tested similar mean
differences in a teachers college and in a community college. Neither of these differences was
statistically significant.
Why did Winter and McClelland test the relation in the teachers college and in the community
college?
It is suggested that students look up the original report —
it is well worth study —
and do analysis
of variance from the reported /I's, means, and standard deviations, using the method outlined in
Chapter 13 (Addendum).
9. One
virtue of analysis of covariance seldom mentioned in texts is that three estimates of the
correlation between X and Y can be calculated: the total r over all the scores, the between-groups r,
which is the r between the X and Y means, and the within-groups r. the r calculated from an average
of the r's between X and Y within the k groups. The within-groups r is the "best" estimate of the
"true" ; between X and Y. Why is this so?
[Hini: Can a total r, the one usually calculated in practice, be inflated or deflated by between-
groups variance?]
10. The 2x2x2 factorial design is used a good deal by social psychologists. Here are three
unusual, excellent, even creative studies in which it was used:
J. Freedman. S. Wallington. and E. Bless. "Compliance Without Pressure: The Effect of Guilt," Journal
Aronson, E., and Gerard, E. "Beyond Parkinson's Law: The Effect of Excess Time on
Subsequent Performance," Journal of Personalitx and Social Psychotogw 3 (1966), 336-
339.
Carlsmith, J., and Gross, A. "Some Effects of Guilt on Compliance, "yoi/rna/o/P^wona/iO'
and Social Psychology. 11 (1969), 232-239.
Jones, E., et al. "Pattern of Performance and Ability Attribution: An Unexpected Primacy
Effect," Journal of Personality and Social Psychology. 10 (1968), 317-340.
Read one of these studies. (The Jones study is long, involved, and difficult, but well worth the
effort. It has replication and systematic exploration of alternative hypotheses, as well as high theo-
retical and technical competence.)
PART SEVEN
ssc
TYPES OF RESEARCH
Chapter 22
Nonexperimental
Research
Among prevalent fallacies, one of the most dangerous to science is that known as post
hoc, ergo propter hoc: after this, therefore caused by this. We may joke, with a tinge of
seriousness, "If I take an umbrella, it won't rain." We may even seriously say that
delinquents are delinquent because of a lack of discipline in the schools or that religious
education makes children more virtuous. It is easy to assume that one thing causes another
simply because it occurs before the other, and because one has such a wide choice of
possible "causes." Then, too, many explanations often seem plausible. It is easy to
believe, for instance, that the learning of children improves because we institute a new
educational practice or teach in a certain way. We assume that theimprovement in their
learning was due to the new spelling method, to the institution of group processes into the
classroom situation, to stem discipline and more homework (or little discipline and less
homework). We rarely realize that children will usually learn something if they are given
the opportunity to learn.
The social scientist and the educational problem of the
scientist constantly face the
post hoc fallacy. The sociologist who seeks the causes of delinquencythat extreme knows
care must be used in studying the problem. Slum conditions, broken homes, lack of
love — each or all of these conditions are possible causes of delinquency. The psychologist
seeking the roots of adult personality faces an even subtler problem: hereditary traits,
The danger of the post hoc assumption is that it can, and often does, lead to erroneous
and misleading interpretations of research data, the effect being particularly serious when
scientists have little or no control over time and independent variables. When they seek to
explain a phenomenon that has already occurred, they are confronted with the unpleasant
fact that they do not have real control of the possible causes. Hence they must pursue a
course of research action different in execution and interpretation from that of scientists
who experiment.
Definition
Nonexperimental research is systematic empirical inquiry in which the scientist does not
have direct control of independent variables because their manifestations have already oc-
curred or because they are inherently not manipulable. Inferences about relations among varia-
bles are made, without direct intervention, from concomitant variation of independent and
dependent variables.
'The basic logic is set forth in: F, Kerlinger. "Research in Education." In R. Ebel, V Noll, and R. Bauer,
eds.. Encyclopedia of Educational Research. 4th ed. New York: Macmillan. 1969, pp. 1 127-1 144 (pp. 1133-
1134).
Nonexperimental Research • 349
.Vand other possible .v's, the "truth" of the hypothesized relation between x and y cannot
be asserted with the confidence of the experimental situation. Basically, nonexperimental
research has, so to speak, an inherent weakness: lack of control of independent variables.
The most important difference between experimental research and nonexperimental
research, then, is control. In experiments, investigators at least have manipulative control:
they have one active variable. If an experiment is a "true" experiment, they can
at least
also exercise control by randomization. They can assign subjects to groups at random, or
can assign treatments to groups at random. In the nonexperimental research situation, this
kind of control of the independent variables is not possible. Investigators must take things
as they are and try to disentangle them.
Take a well-known case. When we paint the skins of rats with carcinogenic substances
(.v), adequately control other variables, and the rats ultimately develop carcinoma (v), the
argument is compelling because .v (and other possible .v's, theoretically) is controlled and v
is predicted. But when we find cases of lung cancer (y) and then go back among the
possible multiplicity of causes (x,, .vt v„) and pick cigarette-smoking (say X3) as the
culprit, we are in a more difficult and ambiguous situation. Neither situation is sure, of
course; both are probabilistic. But in the experimental case we can be more sure
considerably more sure if we have adequately made "other things equal" —
that the state-
ment If .V, then y is empirically valid. In the nonexperimental case, however, we are
always on shakier ground because we cannot say, with nearly as much assurance, "other
things equal." We cannot control the independent variables by manipulation or by ran-
domization. In short, the probability that .v is "really" related to y is greater in the
experimental situation than it is in the nonexperimental situation, because the control oix
is greater.
lung cancer — or who had died of — and it those who did not have it. The dependent
variable was thus the presence or absence of cancer. Investigators probed the subjects'
backgrounds to determine whether they smoked cigarettes, and if so, how many. Ciga-
rette-smoking was the independent variable. The investigators found that the incidence of
lung cancer rose with the number of cigarettes smoked daily. They also found that the
incidence was lower in the cases of light smokers and nonsmokers. They came to the
conclusion that cigarette-smoking causes lung cancer." This conclusion may or may not
be true. But the investigators cannot come to this conclusion, although they can say that
there is a statistically significant relation between the variables.
The reason they cannot state a causal connection is that there are a number of other
variables, any one of which, or any combination of which, may have caused lung cancer.
And they have not controlled other possible independent variables. They cannot control
them, except by testing alternative hypotheses, a procedure to be explained later. Even
when they also study "control groups" of people who have no cancer, self-selection may
be operating. Maybe tense, anxious men are doomed to have lung cancer if they marry tall
women, for instance. It may just happen that this type of man also smokes cigarettes
heavily. The cigarette-smoking is not what kills him —
he kills himself by being bom
tense and anxious —
and possibly by marrying a tall woman. Such men are selected into
the sample by investigators only because they smoke cigarettes. But such men select
themselves into the sample because they commonly possess a temperament that happens
to have cigarette-smoking as a concomitant.
Self-selection can be a subtle business. There are two kinds: self-selection into sam-
ples and into comparison groups. The latter occurs when subjects are selected because
they are in one group or another; cancer and no cancer, college and no college, undera-
chievement and no underachievement. That is, they are selected because they possess the
dependent variable in greater or lesser degree. Self-selection into samples occurs when
subjects are selected in a nonrandom fashion into a sample.
The crux of the matter is that when assignment is not random, there is always a
loophole for other variables to crawl through. When we put subjects into groups, in the
above case and in similar cases, or they "put themselves" into groups, on the basis of one
variable, it is possible that another variable (or variables) correlated with this variable is
the "real" basis of the relation. The usual nonexperimental study uses groups that exhibit
differences in the dependent variable. In some longitudinal-type studies the groups are
on the basis of the independent variable. But the two cases are basically
differentiated first
the same, since group membership on the basis of a variable always brings selection into
the picture.
For example, we may select college freshmen at random and then follow them to
determine the relation between intelligence and success in college. The students selected
themselves into college, so to speak. One or more of the characteristics they bring with
them to college, other than intelligence —
socioeconomic level, motivation, family back-
ground —
may be the principal determinants of college success. That we start with the
independent variable, in this case intelligence, does not change the self-selective nature of
the research situation. In the sampling sense, the students selected themselves into col-
lege, which would be an important factor if we were studying college students and non-
college students. But if we are interested only in the success and nonsuccess of college
weight of evidence at present implicates smoking as the principal etiological [causative] factor in the increased
incidence of lung cancer."
Nonexperimental Research • 351
One of most important and influential studies of the century was the set of investiga-
the
tions into ethnocentrism and authoritarianism reported in the book The Authoritarian
Personalit}-.^ The general hypothesis of the study was that political, economic, and social
beliefs are related to deep-seated personality characteristics. Another hypothesis was that
adult personality is derived from early childhood experiences. In short, attitudes and
beliefs were related to underlying personality trends. The investigators, among other
things, studied anti-Semitism as part of a general characteristic called ethnocentrism.
Later, the investigators extended their thought and work to a still larger construct, authori-
tarianism, which they conceived to be a broad personality syndrome that determines in
part ethnocentrism, social attitudes, and certain other behaviors. The authoritarian person-
ality was conceived to be conventional, cynical, destructive, aggressive, power-centered,
and ethnocentric."*
While this is an inadequate summary of the basic problems of a complex study, it is
sufficient for the present purpose. The study had to be nonexperimental because authori-
tarianism and ethnocentrism, as defined, are nonmanipulable variables, as are most of the
variables related to authoritarianism. One of the major results of the study, for instance,
^T. Adomo, E. Frenkel-Brunswik. D. Levinson. and R. Sanford, The Authoritarian Personality. New
York: Harper & Row. 1950.
"For evidence that the authors' theory about the characteristics of the authoritarian personality and its
measurement was in general well-conceived, see F. Kerlinger and M. Rokeach, "The Factorial Nature of the F
and D Scales." Journal of Personality and Social Psychology. 4 (1966). 391-399.
352 • Types of Research
A large preoccupation of educational researchers has been a search for the determinants of
school achievement.^^ What are the factors that lead to successful achievement in school
and unsuccessful achievement? Intelligence is an important factor, of course. While
measured intelligence, especially verbal ability, accounts for a large proportion of the
variance of achievement, there are many other variables, psychological and sociological:
sex, race, social class, aptitude, environmental characteristics, school and teacher charac-
teristics, family background, teaching methods. The study of achievement is character-
ized by both experimental and nonexperimental approaches. We are here concerned only
with the latter since it clearly illustrates problems of nonexperimental research.
1966 the now famous Coleman report was published." As its title indicates, it was a
In
large-scale attempt to answer the question: Do American schools offer equal educational
opportunity to all children? Equally important, however, was the question of the relation
between student achievement and the kinds of schools students attend. This study was a
massive and admirable effort to answer these questions (and others). Its most famous and
controversial finding was that the differences among schools account for only a small
fraction of the differences in school achievement. Most achievement variance was ac-
counted for by what the children bring with them to school. There was much to question
about the study's methodology and conclusions.^ Indeed, its reverberations are still with
us. The principal dependent variable was verbal achievement. There were, however, more
than 100 independent variables. The authors used relatively sophisticated multivariate
procedures to analyze the data. Much of the core of the analytic problems, the interpreta-
tions of the findings, and the subsequent critiques inhere in the nonexperimental nature of
the research.
The controversial conclusion mentioned above of the relative importance of home
background variables and school variables depends on a completely reliable and valid
method for assessing relative impacts of different variables. In experimental research, one
is safer drawing comparative conclusions because the independent variables are not corre-
^The lilerature is large. A good guide to some of Ihe work is: B. Bloom. Siabilin- and Change in Human
Characteristics. New York: Wiley. 1964, especially chap. 4. Another valuable book is: H. Hyman. C. Wright,
and Reed, The Enduring Effects of Education. Chicago: University of Chicago Press. 1975. An impressive
J.
work mainly on teaching and the observation of teaching is: M. Dunkin and B. Biddle, Tlie Study of
that focuses
Teaching. New York: Holt, Rinehart and Winston, 1974.
'J. Coleman, E. Campbell. C. Hobson, J. McPartland. A, Mood. F. Weinfeld. and R. York. Equality of
Educational Opportunity. Washington. D.C.: U. S. Govt. Printing Office. 1966.
^An important evaluation o( Equalit\- is: F. Mosteller and D. Moynihan, eds.. On Equality of Educational
Opportunity. New York: Vintage Books, 1972. Another important critique and analysis is: C. Jencks and others.
Inequality : A Reassessmeitt of the Effect of Family and Schooling in America. New York: Basic Books, 1972
(paperback: Harper Colophon Books, 1973).
Nonexperimental Research • 353
latod. In the real educational world, however, the variables are correlated, making their
unique contributions to achievement hard to determine. While there are statistical methods
to handle such problems, no methods can tell us unambiguously that X| influences Y to
this or that extent because the real inlluence may
which influences both X| and Y.
be A'l,
unattainable. While there are powerful analytic methods to use with nonexperimental
data, unequivocal answers to questions of the determinants of achievement are forever
beyond reach.
than those included in the research are the "real"" source of the variance of the dependent
variable. I deliberately selected what I thought was a major study with excellent method-
ology to underline the difficulties of nonexperimental research. For the obtained important
relation found between citizen participation and political leader responsiveness, for exam-
ple, might the substantial relation be due not to citizen participation and social status but,
say, to the (presumed) fact that citizens who participate more are also upper social status
people and so are the political leaders? That is, the leaders are responsive not so much
because citizens participate at a high level but because higher social status citizens partici-
pate more than lower status citizens, and leaders respond to the social status and its —
—
accompanying education, influence, and attitudes rather than the participation as such.
The participation, in other words, is a variable that "helps'" make social status visible to
leaders.'' Another reservation, as the authors themselves bring out, is that the concurrence
of leaders and citizens was not measured in urban cores. Is it possible that in such cores
the relation is negligible?
"S. Verba and N. Nie, Participation in America: Political Democracy and Social Equality. New York:
Harper & Row, 1972.
""See ibid., pp. 336-337, especially Table 20-1, where the above argument actually turns out to be in part
correct.
354 • Types of Research
reasons. One, they each represent a unique, original, and interesting approach to an
important sociological, psychological, or educational problem. Two, each contributes
significantly to scientific knowledge. And three, each is nonexperimental.
Again, this example is really a set of researches all directed to the same general question,
which can be loosely expressed: What effects do teachers' classroom management proce-
dures have on children in classrooms?'" The research of Kounin and his colleagues has
been characterized by original and significant variables, both independent and dependent,
by systems of extensive observations of the behavior in classrooms of teachers and pupils
specifically aimed at measuring the variables, and by careful operationalizations of the
variables. In one such study, for example, Kounin and Doyle analyzed videotapes of 596
formal lessons taught by 36 teachers in a preschool aimed at measuring the variables
lesson continuity and task involvement. The hypothesis tested was simple: The more
continuous and unlagging a lesson, the greater the task involvement of children.
During lessons, observers coded children's behavior for appropriate involvement
every six seconds. Percentages of involvement scores were calculated. Continuity of
lessons was measured by noting and timing child recitations, which were thought to be
more discontinuous than the "official" teacher reading and demonstrating. In other
words, if a lesson had a high proportion of child recitations, it was considered discontinu-
ous. High task-involvement and low task-involvement reading lessons, as categorized by
observers, were compared using the percentages of child-recitation times. The mean
percentages of child recitation for high task-involvement lessons was 8.40; the mean for
low task-involvement lessons was 20.20. The difference was statistically significant. A
similar analysis of demonstration lessons yielded similar results. The authors concluded:
"measured degrees of continuity within lesson types distinguished between those lessons
that had high task involvement when manned by the same occupants" (p. 163).
This is interesting and potentially important research —
and all the Kounin studies are
characterized by an imaginative yet objective approach to teacher and pupil observation.
Unfortunately, none of the studies is experimental. How much more convincing the rela-
tion between lesson continuity and task involvement, for example, if lesson continuity had
been experimentally manipulated and task involvement had been substantially affected
thereby! Questions can and should also be raised about the selection of lessons for analysis
and about the analysis of the data. (It seems, again, that independent variable measures
were analyzed rather than, as one would normally expect, measures of the dependent
variable." This possible error, incidentally, could probably not have happened if the
research had been experimental since there would have been k experimental groups in
which different amounts of continuity [or discontinuity] had been engendered and the
analysis perforce of the task-involvement scores and means of these groups.) Despite
'°J. Kounin, Discipline and Group Managemenl in Classrooms. New York: Holt. Rinehart and Winston,
1970. Other interesting studies have been published since this book's appearance. The one chosen to be summa-
rized is: J. Kounin and P. Doyle, "Degree of Contmuity ofa Lesson's Signal System and the Task Involvement
of Children," Journal of Educational Psychology. 67 (1975), 159-164.
"The means of 18.8 for the high task-involvement group and
authors report in their Table 4. for instance,
15.1 for the low task-involvement group, not significantly different. But these are means of child recitation
scores and nol task-involvement scores. 1 did what amounts to an analysis of group membership scores, mem-
bership in the high and low task-involvement groups, predicting such membership from the child-recitation
scores. I did similar analyses for the data of their four tables. In all cases, the conclusions were the same as
theirs. It is possible, however, that they might not have been the same. (Later I will show how such analyses are
done.)
Nonexperimental Research • 355
these methodological caveats — which are troublesome because they becloud the substan-
tive issues — the Kounin et al. studies, although not experimental even when they could
he. are creative empirical approaches to long-standing teaching
problems and methods of
classroom discipline and management and the controlled observation of teaching variables
in the classroom. Moreover, a theory of classroom management seems possible to develop
on the basis of the empirical work.
There has been much thought and speculation on the relations among capitalism, Protes-
tantism, and achievement. Max Weber, for example, wrote an important book on capital-
ism and Protestantism.'- His basic hypothesis was that Protestantism led to the spirit of
capitalism because the Protestant ethic —
self-reliance, deferment of enjoyment, asceti-
cism, emphasis on achievement, and so on —
produced individuals with the qualities
necessary to capitalistic enterprise and development. McClelland, in a remarkable book,
has described his many studies on the relation between Protestantism and capitalism.'^
Actually. McClelland's main interest has been on motivation and its measurement, and
his research can safely be called one of the successes of psychology. The variable of his
principal interest and work has been achievement motivation, commonly called n
Achievement, or n Ach, which he has measured by asking individuals to write brief
stories suggested by pictures shown them for a few seconds, the pictures representing
situations related to work. The stories were then content-analyzed using a complex coding
system to obtain scores of n Achievement for each individual. In the present study n
Achievement was an independent variable among several independent variables used to
predict the economic growth of nations.
McClelland's hypothesis was that countries whose population is predominantly Prot-
estant will emphasize achievement. Protestant countries should thus show greater capital-
istic enterprise than Catholic countries, other things equal. "Capitalistic enterprise" is
'-M. Weber. The Protestant Ethic and the Spirit of Capitalism (transl.. T. Parsons). New York: Scribner,
1930 (1904).
'^D. McClelland, The Achieving Society. Princeton: Van Nostrand, 1961.
356 • Types of Research
drawing valid conclusions from the evidence presented. On the other hand, McClelland's
work greatly advanced scientific understanding of the presumed determinants of capitalis-
tic enterprise and growth and of achievement motivation, even though one wishes for
What are the conditionsand determinants of political democracy? Sociologists and politi-
cal scientists, long interested in this and related questions and perhaps stimulated by
advances in analytic approaches, have recently directed empirical inquiries to the problem
of political democracy. We now examine a good example of such research again to
illustrate nonexperimental inquiry and the testing of alternative hypotheses.'*^ The study
has considerable intrinsic interest. We can also help to build our sensitivity to and knowl-
edge of multivariate approaches to complex problems. Other points of interests are the
unit of analysis used, countries, and the skillful measurement of complex concepts but
particularly political democracy. Bollen directed his study mainly to the question of which
is the more important factor in the development of political democracy; timing of develop-
ment or level of development. That is, are late-developing countries less likely than
early-developing countries to have attained political democracy? Plausible arguments
have been advanced to answer the question. Bollen sought an empirical answer. In addi-
tion, he studied the ideas that Protestantism leads to higher political development, a nice
link to McClelland's research, and that the greater a country's control of the economic
system, the lower the level of democracy.
Using a sample of 99 countries with widely different levels of development, Bollen
assessed the effects of time and levels of development on political democracy. To do this,
he used multiple regression analysis, which has been mentioned before and will be dis-
cussed in detail later in the book. The results of the analysis supported the hypotheses,
with one or two exceptions. The most important variable affecting political democracy
was level of development and not timing of development: the greater the development of
a country, as measured by its energy consumption, the greater its political democracy.
That is, how much development was far more important than when the development took
place. The proportion of the population that was Protestant and the state's control of the
economy were also significant influences, the former positive and the latter negative.
With these nonexperimental studies behind us. we can discuss and evaluate nonexper-
imental research in general. We precede evaluative discussion, however, with a more
systematic inquiry into the testing of alternative hypotheses, one of the highly important
features of scientific research.
'""My appraisal of McClelland's thinking and work is perhaps too restrained. Like most good scientists,
McClelland's mind has roamed over a wide assortment of human activities, and attempted to explain them with
the Protestant ethic and n Achievement. He even speculates about preferences for colors and for travel and —
then tests the speculations, or cites the work of others in testing them! (See. e.g., ibid., pp. 309ff. ) 1 should have
said that the book is an imaginative classic!
'^K. Bollen. "Political Democracv and the Timing of Development." Ameriian Sociological Reyie»-. 44
(1979), 572-587.
Nonexperimenlal Research • 357
chapters, we can also "confirm" and "disconfirm" hypotheses under study by trying to
show hypotheses are or are not supported. First consider alterna-
that alternative plausible
tive independent variables as antecedents of a dependent variable. The reasoning is the
same. If we say "alternative independent variables." for example, we are in effect stating
alternative hypotheses or explanations of a dependent variable.
Innonexperimenlal studies, although one cannot have the confidence in the "'truth" of
an If .V. then y" statement that one can have in experiments, it is possible to set up and
test alternative or "control"" hypotheses. (Of course, alternative hypotheses can be and
are tested in experimental studies, too. This procedure has been formalized and explained
)
the influential independent variable. Since the alternative or '"control" hypotheses have
not been substantiated, the original hypothesis is strengthened.
Similarly, we can
dependent variables, which also imply alternative
test alternative
was nonexperimental because it was not possible to manipulate the independent variable
and because the children came to the study with their reactions ready-made, as it were.
""J. Plait. --Strong Inference."" Science. 146 (1964), 347-353; T. Chamberlin, --The Method of Multiple
Working Hypotheses." Science. 147 (1965). 754-759. The Chamberlin article was ongmally published in
Science in 1890 (vol. 15). A clear explanation of the logic behind testing alternative hypotheses is given in: M.
Cohen and E. Nagel. An Introduction to Logic and Scientific Method. New York: Harcourt Brace Jovanovich,
1934, pp. 265-267.
"Chamberlin. op. cit.. p. 756.
'*T. Alper. H. Blane. and B. Abrams. "Reactions of Middle and Lower Class Children to Finger Paints as
a Function of Class Differences in Child-Training Practices." Journal of Abnormal and Social Psychology. 5
(1955). 439-448.
358 • Types of Research
This use of a control study was ingenious and crucial. Imagine the researchers' consterna-
tion if the differences between the two groups on the crayon task had been significant!
Now consider a study by Samoff et al. in which it was predicted that English and
American children would differ significantly in test anxiety but not in general anxiety.'^
The hypothesis was carefully delineated: If eleven-plus examinations are taken, then test
anxiety results. (The eleven-plus examinations are given to English school children at
eleven years of age to help determine their educational futures.) Since it was possible that
there might be other independent variables causing the difference between the English and
American children on test anxiety, the investigators evidently wished to rule out at least
some of the major contenders. This they accomplished by carefully matching the samples:
they probably reasoned that the difference in test anxiety might be due to a difference in
general anxiety, since the measure of test anxiety obviously must reflect some general
anxiety. If this were found to be so. the major hypothesis would not be supported. There-
fore Samoff and his colleagues, in addition to testing the relation between examination
and test anxiety, also tested the relation between examination and general anxiety.
The method of testing alternative hypotheses, though important in all research, is
particularly important in nonexperimental studies, because it is one of the only ways to
"control" the independent variables of such research. Lacking the possibility of randomi-
zation and manipulation, nonexperimental researchers, perhaps more so than experimen-
talists, must be very sensitive to alternative hypothesis-testing possiLilities.
Nonexperimental research has three major weaknesses, two of which have already been
discussed in detail: (1) the inability to manipulate independent variables, (2) the lack of
power to randomize, and (3) the risk of improper interpretation. In other words, compared
to experimental research, other things equal, nonexperimental research lacks control; this
lack is the basis of the third weakness: the risk of improper interpretation.
The danger of improper and erroneous interpretations in nonexperimental research
stems in part from the plausibility of many explanations of complex events. It is easy to
accept the first and most obvious interpretation of an established relation, especially if one
works without hypotheses to guide investigation. Research unguided by hypotheses, re-
search "to find out things," is most often nonexperimental. Experimental research is
more likely to be based on carefully stated hypotheses.
Hypotheses are if-then predictions. In a research experiment the prediction is from a
well-controlled .v to a y. If the prediction holds true, we are relatively safe in stating the
'"I. Samoff, F. Lighthall. R. Waite, K. Davidson, and S. Sarason, "A Cross-Cultural Study of Anxiety
Among American and English School Children,"' Journal of Educational Psychology. 49 (1958), 129-136.
Nonexperimental Research • 359
conditional. It' v. then y. In a nonexperimental study under the same conditions, however,
we are considerably less safe in stating the conditional, for reasons discussed earlier.
Careful safeguards are more essential in the latter case, especially in the selection and
testing of alternative hypotheses, such as the predicted lack of relation between the ele-
ven-plus examination and general anxiety in the Samoff study. A predicted (or unpre-
dicted) relation in nonexperimental research may be quite spurious, but its plausibility and
conformity to preconception may make it easy to accept. This is a danger in experimental
research, but it is less of a danger than it is in nonexperimental research because an
experimental situation is so much easier to control.
Nonexperimental research that is conducted without hypotheses, without predictions,
research in which data are just collected and then interpreted, is even more dangerous in
its power to mislead. Significant differences or correlations are located if possible and
then interpreted. Assume that an educator decides to study the factors leading to undera-
chievement. He selects a group of underachievers and a group of normal achievers and
administers a battery of tests to both groups. He then calculates the means of the two
groups on the tests and analyzes the differences with t tests. Among, say, twelve such
differences, three are significant. The investigator concludes, then, that underachievers
and normal achievers on the variables measured by these three tests. Upon analysis
differ
of the three tests, he thinks he understands what characterizes underachievers. Since all
three of the tests seem to measure insecurity, the cause of underachievement is therefore
insecurity.
When guided by hypotheses the credibility of the results of studies like the one just
cited be enhanced, but the results are still weak because they capitalize on chance: by
may
chance alone one or two results of many statistical tests may be significant. Above all,
plausibility can be misleading. A plausible explanation often seems compelling even —
though quite wrong! It seems so obvious, for example, that conservatism and liberalism
are opposites. The research evidence, however, seems to indicate that they are not op-
posites.-" difficulty is that plausible explanations, once found and believed, are
Another
often hard to According to Merton, post factum explanations do not lend themselves
test.
to nullifiability because they are so flexible. Whatever the observations, he says, new
"^'
interpretations can be found to "fit the facts.
gidity, ethnocentrism — will show that they are not manipulable. Controlled inquiry is
can even be said that nonexperimental research is more important than experimental
It
research. This is, of course, not a methodological observation. It means, rather, that most
social scientific and educational research problems do not lend themselves to experimen-
tation, although many of them do lend themselves to controlled inquiry of the nonexperi-
mental kind. Consider Piaget's studies of children's thinking, the authoritarianism studies
^°See F. Kerlinger, "Social Attitudes and Their Criteria! Referents: A Structural Theory." Psychological
Review. 74 (1967). 110-122; "Analysis of Covariance Structure Tests of a Criterial Referents Theory of
Attitudes," Mullivariate Behavioral Research. 15 1980). 403-422; Liberalism and Conservatism: The Nature
(
of Adomo et al., the highly important study Equality of Educational Opportunity, and
McClelland's studies of need for achievement. If a tally of sound and important studies in
the behavioral sciences and education were made, it is possible that nonexperimental
studies would outnumber and outrank experimental studies.
CONCLUSIONS
Students of research differ widely in their views of the relative values of experimental and
nonexperimental research. There are those who exalt experimental research and decry
nonexperimental research. There are those who criticize the alleged narrowness and lack
of "reality" of experiments, especially laboratory experiments. These critics, especially
in education, emphasize the value and relevance of nonexperimental research in "real-
life," "natural" situations. A rational position seems obvious. If it is possible, use exper-
imentation because, other things equal, one can interpret the results of most experiments
with greater confidence that statements of the If p. then q kind are what we say they are. It
nonexperimental approaches. The research of Kounin and his colleagues on the influence
of teacher variables is impressive and convincing. But how much more impressive and
convincing it would be if similar conclusions arise from well-conducted experiments!
Conversely, how much more convincing experimental conclusions are if substantiated in
well-conducted nonexperimental research.
Replication is always desirable, even necessary. An important point being made is that
replication of research does not only mean repetition of the same studies in the same
settings.It can and should mean testing empirical implications of theory interpreting —
"theory" broadly —
in similar and dissimilar situations ««(/ experimentally and nonexper-
imentally It is easier to ask for extensions of research from the laboratory to the field. But
.
methodological road to scientific validity; there are many roads. And we should choose
our roads for their appropriateness to the problems we study. This does not mean, how-
ever, that we cannot exploit an approach that is different from what we are used to.
For some stange reason, perhaps the spurious belief in the alleged certitude of science,
when people, including scientists, think of science and scientific research, they mistak-
enly believe there is only one "right" way to approach and do research. Rarely is such a
mistake made in music, or art. or building a house. Science, too, has many roads, and
experimental and nonexperimental approaches are two such broad roads. Neither is right
or wrong. But they are different. Our task has been to try to understand the differences
and their consequences. We are far from finished with the subject, however. Maybe we
will even attain a fair degree of understanding before we are through.
Nonexperimental Research • 361
ADDENDUM
Causality and Scientific Research
The study and analysis of "causal relations" in research has recently preoccupied social
scientists. Economists and econometricians seem to have been the leaders in this work.--
Simon, a psychologist, has also been a pioneer. ' Sociologists have written extensively on
the subject.-'* The analytic and conceptual movement is productive, as we will see in a
later chapter. It is perhaps unfortunate that the word "cause" and "causal relations" have
been used. They imply that science can find the causes of phenomena.
The position taken in this book is that the study of cause and causation is an endless
maze. One of the difficulties is that the word "cause" has surplus meaning and metaphys-
ical overtones. Perhaps more important, it is not really needed. Scientific research can be
done without invoking cause and causal explanations, even though the words and other
words that imply cause are almost impossible to avoid and thus will occasionally be used.
Blalock points out that causal laws cannot be demonstrated empirically, but that it is
helpful to think causally.-^ I agree that causal laws cannot be demonstrated empirically,
but am equivocal about thinking causally. There is little doubt that scientists do think
causally and that when they talk of a relation between p and q they hope or believe that p
causes q. But no amount of evidence can demonstrate that p does cause q.
This position is not so much an objection to causal notions as it is an affirmation that
they are not necessary to scientific work. Evidence can be brought to bear on the empirical
validity of conditional statements of the If p. then q kind, alternative hypotheses can be
tested, and probabilistic statements can be made about p and ^-and other /7's and q'& and
conditions r. s, and t. Invocation of the word "cause" and the expression "causal rela-
tion" does nothing really constructive. Indeed, it can be misleading.
In expert hands and used with circumspection, path analysis and related methods can
help to clarify theoretical and empirical relations.-^ But when their espousal and use imply
that causes are sought and found, such methods can also be misleading. In sum, the
elements of deductive logic in relation to conditional statements, a probabilistic frame-
work and method of work and inference, and the testing of alternative hypotheses are
sufficient aids to scientific nonexperimental work without the excess baggage of causal
notions and methods presumably geared to strengthening causal inferences. We rest the
the word "cause" is so inextricably bound up with misleading associations as to make its
complete extrusion from the philosophical vocabulary desirable ... the reason physics has
ceased to look for causes is that, in fact, there are no such things. The law of causality ... is
a relic of a bygone age, surviving, like the monarchy, only because it is erroneously supposed to
do no harm.-^
"See H. Wold and L. Jureen. Demand Analysis. New York: Wiley, 1953.
"H. Simon. Models of Men. New York: Wiley, 1957.
-'E.g., H. Blalock, Causal Inferences in Nonexperimental Research. Chapel Hill, N.C.: University of
North Carolina Press, 1961.
-^Ibid.. p. 6.
^*For an extended discussion of the value and use of path analysis and so-called commonality
analysis in
Study Suggestions
(b) Whatever judgment you have made, can you justifiably reverse the variables?
(c) Do you think a research project designed to investigate this problem would be basically
experimental or nonexperimental?
(d) Can the investigatordo two researches, one experimental and one nonexperimental,
both designed to same hypothesis?
test the
(e) If your answer to (d) was Yes, will the variables of the two problems be the same?
Assuming that the relations in both researches were significant, will the conclusions be
substantially the same?
4. Suppose that you want to study the effects of the decisions of boards of education on various
aspects of education, such as teacher morale, pupil achievement, relations between teachers and
administrators, teacher clique formation. Would your research be experimental or nonexperimental?
Why?
5. In the study suggestions of Chapter 2, a number of problems and hypotheses were given.
Take each of these problems and hypotheses and decide whether research designed to explore the
problems and test the hypotheses would be basically experimental or nonexperimental. Can any of
the problems and hypotheses be tackled in both ways?
6. McClelland, in the study described in this chapter, presents data on the electrical production
during 1952-1958 of countries high in n Achievement and low in /; Achievement."* Counting the
number of countries in each of the four cells we obtain the results shown in Table 22.1.
.65
Low n Ach
Nonexperimental Research • 363
in education is mixed: experimental and nonexperimental. Political science is also mixed. Anthro-
pology is mostly nonexperimental. (Think, loo. of economics, astronomy, physics, chemistry, and
biology.) Why arc some sciences predominantly experimental and others nonexperimental? Explain
specifically what you mean.
8. The venturesome student may wish to take a plunge into stimulating, provocative, contro-
versial, and important thinking. The famous Club of Rome report has outraged some observers,
startled almost anyone who has read it. and disturbed everyone."'' Using societally important varia-
bles — natural resources, pollution, population, for example —
and their complex interactions, ulti-
mate disaster to cities and the world has been predicted. The research on which the conclusions are
based is entirely nonexperimental. Try reading the report and perhaps one of the works of the
pioneer of the area of study. Professor Foixester of MIT.'" Do you think that the research's nonex-
perimental character lowers its credibility?
-^D. H. Meadows, D. L. Meadows, J. Randers. and W. Behrens, The Limits to Growth. 2d ed. New York:
Laboratory Experiments,
Field Experiments,
and Field Studies
Social scientific research can be divided into four major categories: laboratory experi-
ments, field experiments, field studies, and survey research.' This breakdown stems from
two sources, the distinction between experimental and nonexperimental research and that
between laboratory and field research.
'This chapter owes much to L. Festinger and D. Katz. eds.. Research Methods in the Behavioral Sciences.
New York: Holt. Rinehart and Winston. 1953. chaps. 2. 3. 4. Although thirty years old. this book is slill a
valuable source on many aspects of behavioral research methodology.
-N. Miller. •Learning of Visceral and Glandular Responses," Science. 168 ( 1969). 434-445; N. Miller.
Selected Papers. Chicago: Aldine. 1971. Pan XL
Laboratory Experiments, Field Experiments, and Field Studies • 365
"taught."' Miller and his colleagues' work has shown that, through instrumental condi-
tioning, the heart rate can be changed,stomach contractions can be altered, and even urine
formation can be increased or decreased! This discovery is of enormous theoretical and
practical importance. To show the nature of laboratory experiments, we take one of
Miller's interesting and creative experiments.
The idea of the experiment is simple; reward one group of rats when their heart rates
go up, and reward another group when their heart rates go down. This is a straightforward
example of the two-group design discussed earlier. Miller's big problem was control.
There are a number of other causes of changed heart rate —
for example, muscular exer-
tion. To control such extraneous variables. Miller and a colleague (Trowill) paralyzed the
rats with curare. But if the rats were paralyzed, what could be used as reward? They
decided to use direct electrical stimulation of the brain. The dependent variable, heart
rate, was continuously recorded with the electrocardiograph. When a small change in
heart rate occurred (in the "right" way: up for one group, down for the other), an animal
was given an electrical impulse to a reward center of its brain.'* This was continued until
the animals were "trained."
The increases and decreases of heart rate were statistically reliable but small: only five
percent in each direction. So Miller and another colleague (DiCara) used the technique
known as shaping, which, in this case, means rewarding first small changes and then
requiring increasing changes in rate to obtain the rewards. This increased the heart rate
changes to an average of 20 percent in either direction. Moreover, further research, using
escape from mild shock as reinforcement, showed that the animals remembered what they
had learned and "differentiated" the heart responses from other responses.
Miller has been successful in "training" a number of other involuntary responses:
and blood pressure, for example. In short, visceral
intestinal contraction, urine formation,
responses can be learned and can be shaped. But can the method be used with people?
Miller says that he thinks people are as smart as rats, but that it has not yet been com-
pletely proved. Although the use of curare might present difficulty, people can be hypno-
tized, says Miller.
'To understand Miller's studies, we must define certain psychological terms. In classical conditioning a
neutral stimulus, inherently unable to produce a certain response, becomes able to by being associated repeatedly
with a stimulus inherently capable of doing so. The most famous example is Pavlov's dog salivating at the
clicking of a metronome, which had been repeatedly associated with meat powder. In instrumental conditioning.
a reinforcement given to an organism immediately after it has made a response produces an increment in the
response. Pigeons, for example, will peck their beaks bloody after having been subjected to certain forms of
instrumental conditioning. In short, reward a response and it will be repeated. Voluntary responses or behavior
are thought to be superior, presumably because they are under the control of the individual, whereas involuntary
responses are inferior because they are not.It has been believed that involuntary responses can be modified only
by and not by instrumental conditioning. In other words, the possibility of "teaching" the
classical conditioning
heart, the stomach, and the blood is remote, since classical conditioning conditions are difficult to come by. If
the organs are subject to instrumental conditioning, however, they can be brought under experimental control,
they can be "taught." and they can "learn." For authoritative accounts of both kinds of conditioning and their
relation to learning, see E. Hilgard and G. Bower. Theories of Learning. 4th ed. Englewood Cliffs, N.J.:
analysis of variance. Hence it is not necessary to labor all the study details here.^ Recall
that the authors randomly selected 240 colleges from a college guide and sent the colleges
application letters from fictitious individuals. In the letters, they manipulated applicants'
race, sex, and ability levels (three such levels). For example, a candidate might be a black
female of high ability, or a white male of medium ability. This is, of course, a 2 x 2 x 3
factorial design. The letter to each college was from a "candidate" who represented one
cell of the 12 cells of the design. The dependent variable measure was obtained by
quantifying the degrees of the colleges' acceptances of candidates. This amounted to a
five-point scale: (1) rejection; (2) rejection, but qualified; . . .; (5) acceptance, with
encouragement.
Race and sex were found not to be significant. From these results alone one might
conclude that there was no bias in admissions. But recall that the interaction of race and
ability was significant. Although there were no differences in admissions at the high and
medium levels of ability, the mean acceptances of men and women at the low ability level
were significantly different. The male mean was 3.00, and the female mean was 1.93.
(See Table 14.14.) So evidently there was discrimination: women were discriminated
against at the low ability level!
^E. Walster, T. Cleary, and M. Clifford, "The Effect of Race and Sex on College Admissions," your«a/o/
Educational Sociology. 44 (1970), 237-244.
*T. Newcomb, Personality and Social Change. New York: Holt. Rinehart and Winston. 1943. A number
of the Bennington students were later restudied in follow-up research designed to test the permanence of the
changes made by Bennmgton: T. Newcomb. K. Koenig, R. Flacks, and D. Warwick, Persistence and Change:
Bennington College and Its Students After Twenty-Five Years. New York: Wiley. 1967. In general, it was found
that the changes had lasted: evidently Bennington's influence was persistent over the years.
Laboratory Experiments, Field Experiments, and Field Studies • 367
Ncwcoinb asked a "control" question: Would these attitudes have changed in other
colleges? To answer this question, Newcomb administered his conservatism measures to
students of Williams College and Skidmore College. The comparable mean scores of
Skidmore students, freshmen through seniors, were; 79.9, 78. 77.0, and 74. It seems
1 , 1 .
that Skidmore (and Williams) students did not change as much and as consistently over
time as did the Bennington students.
situation apartfrom the routine of ordinary living and by manipulating one or more
independent variables under rigorously specified, operationalized, and controlled condi-
tions.
The laboratory experiment has the inherent virtue of the possibility of relatively complete
control. The laboratory experimenter can, and often does, isolate the research situation
from the around the laboratory by eliminating the many extraneous influences that
life
sion instruments. In variance terms, the more precise an experimental procedure is, the
less the error variance. The more accurate or precise a measuring instrument is, the more
certain be that the measures obtained do not vary much from their "true" values.
we can
This is the reliability, which will be discussed in a later chapter.
problem of
Precise laboratory results are achieved mainly by controlled manipulation and meas-
urement in an environment from which possible "contaminating" conditions have been
eliminated. Research reports of laboratory experiments usually specify in detail how the
manipulations were done and the means taken to control the environmental conditions
under which they were done. By specifying exactly the conditions of the experiment, we
reduce the risk that subjects may respond equivocally and thus introduce random variance
into the experimental situation. Miller's experiment is a model of laboratory experimental
precision.
The greatest weakness of the laboratory experiment is probably the lack of strength of
independent variables. Since laboratory situations are, after all, situations that are created
for special purposes, it can be said that the effects of experimental manipulations are
368 • Types of Research
usually weak. Increases and decreases in heart rate by electrical brain reinforcement,
while striking, were relatively small. Compare this to the relatively large effects of inde-
pendent variables Bennington study, for example, the college
in realistic situations. In the
Another weakness is a product of the first; the artificiality of the experimental research
situation. Actually, it is difficult to know if artificiality is a weakness or simply a neutral
characteristic of laboratory experimental situations. When a research situation is deliber-
ately contrived to exclude the many distractions of the environment, it is perhaps illogical
to label the situation with a term that expresses in part the result being sought. The
criticism of artificiality does not come from experimenters, who know that experimental
situations are artificial; it comes from individuals lacking an understanding of the pur-
poses of laboratory experiments.
The temptation to interpret the results of laboratory experiments incorrectly is great.
While Miller's results are believed by social scientists to be highly significant, they
can
only tentatively be extrapolated beyond the laboratory. Similar results may be obtained in
real-life situations, and there is evidence that they do in some cases. But this is not
necessarily so. The relations must always be tested anew under nonlaboratory conditions.
Miller's research, for instance, will have to be carefully and cautiously done with human
beings in hospitals and even in schools.
Although laboratory experiments have relatively high internal validity, then, they lack
external validity. Earlierwe asked the question; Did X. the experimental manipulation,
really make a significant difference? The stronger our confidence in the "truth" of the
relations discovered in a research study, the greater the internal validity of the study.
When a relation is discovered in a well-executed laboratory experiment, we generally can
have considerable confidence in it, since we have exercised the maximum possible control
of the independent variable and other possible extraneous independent variables. When
Miller "discovered" that visceral responses could be learned and shaped, he could be
relatively sure of the "truth" of the relation between reinforcement and visceral response
in the laboratory. He had achieved a high degree of control and of internal validity.
One can say; If I study this problem using field experiments, maybe I will find the
same relation. This is an empirical, not a speculative, matter; we must put the relation to
which we wish to generalize. If a researcher finds that individuals
test in the situation to
converge on group norms in the laboratory, as Sherif did,^ does the same or similar
phenomenon occur in community groups, faculties, legislative bodies? This lack of exter-
nal validity is the basis of the objections of many educators to the animal studies of
learning theory. Their objections are only valid if an experimenter generalizes from the
behavior and learning of laboratory animals to the behavior and learning of children.
'M. Sherif. "Formation of Social Norms: The Expermiental Paradigm." In H. Proshansky and B. Seiden-
berg, eds.. Basic Studies in Social Psychology. New York: Hoh. Rinehart and Winston. 1965. pp. 461-471.
This is a classic laboratory experiment with large implications for both theory and practice.
Laboratory Experiments, Field Experiments, and Field Studies • 369
Capable experimentalists, however, rarely blunder in this tashion — they know that the
laboratory is a contrived environment.
Laboratory experiments have three related purposes. One. they are a means of studying
relations under '"pure"" and uncontaminated conditions. Experimenters ask: Is.v related to
V? How is it related to v? How strong is the relation? Under what conditions does the
relation change? They seek to write equations of the form v =/(.v), make predictions on
the basis of the function, and see how well and under what conditions the function
performs.
A second purpose should be mentioned in conjunction with the first purpose; the
testing of predictions derived from theory, primarily, and other research, secondarily. For
instance, on the basis of Sherif 's norm-convergence finding, one might predict to a num-
ber of other laboratory and field experimental situations, as Sherif did in his later studies
of boys in camp situations. Asch, though, argued that Sherif's stimulus was ambiguous in
the sense that different people would "interpret" it differently.** He wondered whether the
convergence phenomenon would work with clear stimuli in a more realistic setting. A
series of experiments showed that it did.
A third purpose of laboratory experiments is to refine theories and hypotheses, to
formulate hypotheses related to other experimentally or nonexperimentally tested hypoth-
eses, and, perhaps most important, to help build theoretical systems. This was one of
Miller's and Sherif's major purposes. Although some laboratory experiments are con-
ducted without this purpose, of course, most laboratory experiments are theory-oriented.
The aim of laboratory experiments, then, is to test hypotheses derived from theory, to
study the precise interrelations of variables and their operation, and to control variance
under research conditions uncontaminated by the operation of extraneous varia-
that are
bles. As is one of the great inventions of all time. Al-
such, the laboratory experiment
though weaknesses exist, they are weaknesses only in a sense that is really irrelevant.
Conceding the lack of representativeness (external validity) the well-done laboratory ex-
periment still has the fundamental prerequisite of any research; internal validity.''
conditions as the situation will permit. The contrast between the laboratory experiment
and the field experiment is not sharp; the differences are mostly matters of degree. Some-
times it is hard to label a particular study "laboratory experiment" or "field experiment."
Where the laboratory experiment has a maximum of control, most field experiments must
operate with less control, a factor that is often a severe handicap.
discussion, guidance on the actual conduct of experiments has been omitted. The reader who
wants
''In this
to go deeper and get practical guidance can profit from a fme and detailed
chapter by two social psychologists:
Field experiments have values that especially recommend them to social psychologists,
sociologists,and educators because they are admirably suited to many of the social and
educational problems of interest to social psychology, sociology, and education. Because
independent variables are manipulated and randomization used, the criterion of control
can be satisfied — at least theoretically.
The control of the experimental field situation, however, is rarely as tight as that of the
laboratory. Wehave here both a strength and a weakness. The investigator in a field
experiment, though he has the power of manipulation, is always faced with the unpleasant
possibility that his independent variables are contaminated by uncontrolled environmental
variables. We stress this point because the necessity of controlling extraneous independent
variables is particularly urgent in field experiments. The laboratory experiment is con-
ducted in a tightly controlled situation, whereas the field experiment takes place in a
natural, often loose, situation. One of the main preoccupations of the field experimenter,
then, is to try to make the research situation more closely approximate the conditions of
the laboratory experiment. Of course this is often a difficult goal to reach, but if the
research situation can be kept tight, the field experiment is powerful because one can in
general have greater confidence that relations are indeed what one says they are.
As compensation for dilution of control, the field experiment has two or three unique
virtues. The variables in a field experiment usually have a stronger effect than those of
laboratory experiments. The effects of field experiments are often strong enough to pene-
trate the distractions of experimental situations. The principle is: The more realistic the
research situation, the stronger the variables. This is one advantage of doing research in
educational settings. For the most part, research in school settings is similar to routine
educational activities, and thus need not necessarily be viewed as something special and
apart from school life. Despite the pleas of many educators for more realistic educational
research, there is no special virtue in realism, as realism. Realism simply increases the
strength of the variables. It also contributes to externa! validity, since the more realistic
the situation, the more valid are generalizations to other situations likely to be.
Another virtue of field experiments is their appropriateness for studying complex
social and psychological influences, processes, and changes in lifelike situations. Lepper,
Greene, and Nisbett, for example, studied the effects of extrinsic rewards on children's
motivation in a nursery school setting. '" Deci, who had earlier studied the same problem,
used both laboratory experiments and a field experiment. Coch and French, many years
'
'
ago, manipulated participation of workers in planning, and studied its effect on produc-
tion, resignations, and aggression.'"
Field experiments are well-suited both to testing theory and to obtaining answers to
practical questions. Whereas laboratory experiments are suited mainly to testing aspects
of theories, field experiments are suited both to testing hypotheses derived from theories
and to finding answers to practical problems. Methods experiments in education, usually
practical in purpose, often seek to determine which method among two or more methods
is the best for a certain purpose. Industrial research and consumer research depend heavily
on field experiments. Much social psychological research, on the other hand, is basically
'"M. Lepper, D. Greene, and R. Nisbett. "Undermining Children's Intrinsic Interest With Extrinsic Re-
ward:A Test of the 'Overjustification' Hypothesis," Journal of Personalin' and Social Psxchologw 23 (1973),
129-137.
"E. Deci. "Effects of Externally Mediated Rewards on Intrinsic Motivation," Journal of Personality and
Social Psychology. 18(1971), 105-115.
'"L. Coch and J. French. "Overcoming Resistance to Change," Human Relations, 1 (1948), 512-532. (A
pioneering and influential study.)
Laboratory Experiments, Field Experiments, and Field Studies • 371
theoretical. Dutton and Lake, in the study of inverse discrimination described in Chap-
ter 14. were heavily influenced by theories of prejudice. The Lepper et al. and the Deci
'-'
'^D. Dutton and R. Lake, "Threat of Own Prejudice and Reverse Discrimination.'" Journal of Personulity
and Social Psychology. 28 (1973), 94-100.
"Good advice on handling this aspect of field situations is given by J. French, "Experiments in Field
Settings," in Festinger and Katz, op. cit.. pp. 118-129, and D. Katz, "Field Studies." ibid., pp. 87-89.
372 • Types of Research
assume that the administrators or the teachers will not permit random assignment. This
assumption is not necessarily correct, however.
The consent and cooperation of teachers and administrators can often be obtained if a
proper approach, with adequate and accurate orientation, is used, and if explanations of
the reasons for the use of specific experimental methods are given. The points being
emphasized are these; Design research to obtain valid answers to the research questions.
Then, if it is necessary to make the experiment possible, and only then, modify the
"ideal"" design. With imagination, patience, and courtesy, many of the practical prob-
lems of implementation of research design can be satisfactorily solved.
One other weakness inherent in field experimental situations is lack of precision. In
the laboratory experiment it is possible to achieve a high degree of precision or accuracy,
so that laboratory measurement and control problems are usually simpler than those in
field experiments. In realistic situations, there is always a great deal of systematic and
random noise. In order to measure the effect of an independent variable on a dependent
variable in a field experiment, it is not only necessary to maximize the variance of the
manipulated variable and any assigned variables, but also to measure the dependent varia-
ble as precisely as possible. But in realistic situations, such as in schools and community
groups, extraneous independent variables abound. And measures of dependent variables,
unfortunately, are sometimes not sensitive enough to pick up the messages of our inde-
pendent variables. In other words, the dependent variable measures are often so inade-
quate they cannot pick up all the variance that has been engendered by the independent
variables.
FIELD STUDIES
Field studies are nonexperimental scientific inquiries aimed at discovering the relations
and interactions among sociological, psychological, and educational variables in real
social structures. In this book, any scientific studies, large or small, that systematically
pursue relations and test hypotheses, that are nonexperimental, and that are done in life
situations like communities, schools, factories, organizations, and institutions will be
considered field studies.
The investigator in a field study first looks at a social or institutional situation and then
studies the relations among the attitudes, values, perceptions, and behaviors of individuals
and groups in the situation. He ordinarily manipulates no independent variables. Before
we discuss and appraise the various types of field studies, it will be helpful to consider
examples. We have already examined field studies in Chapter 22 and in this chapter: the
Authoritarian Personality study, the Newcomb Bennington study, and others. We now
briefly examine two smaller field studies.
Jones and Cook and politically important hypothesis that prefer-
tested the socially
ences for social policies to advance racial equality are influenced by racial attitudes. More
specifically, individuals with positive attitudes toward blacks will favor societal change
policies and individuals with negative attitudes toward blacks will favor self-improvement
They measured attitudes toward blacks with a well-constructed and validated
policies.'-''
"S. Jones and S. Cook, "The Intluence of Attitude on Judgments of the Effectiveness of Alternative Social
Policy," Journal of Personality ami Social Psychology, 32 (1975), Itl-lli. Some of the data of this study was
used in Chapter 13.
Laboratory Experiments, Field Experiments, and Field Studies • 373
rights and prointegration activities). The dependent variable, preference for social policy,
was measured with a set of 30 policy items, each of which had two alternatives, one
favoring social change and the other self-improvement. The hypothesis was supported:
attitudes toward blacks evidently intluence judgments of effective social policies.
The second smaller field study, part of a larger study by McClelland on the relation
between Protestantism and capitalistic growth, was summari/ed in Chapter 21 Recall that .
several variables —
Protestant-Catholic, need for achievement, and electricity consump-
tion (an index of capitalistic growth or development), among others —
were measured in
25 countries in 1^)25 and 1950. It was found, as predicted, that Protestantism was posi-
tively related to capitalistic growth.
Note that the problems of both field studies were attacked nonexperimentally: neither
randomization nor experimental manipulation was possible. Note, too, an important dif-
ference in the data-gathering methods of the two studies. In the Jones and Cook study,
data were collected directly from students at two universities. In the McClelland study,
however, data on the variables in the 25 countries were "indirectly"" collected from
published sources, mainly world population statistics. While some might argue that the
Jones and Cook study data are stronger than the McClelland data because the former were
collected "directly. '" there is really no difference in principle: both studies are nonexperi-
mental field studies and both are creative and important contributions.
Katz has divided field studies into two broad types: exploratory and hypothesis-testing}^
The exploratory type, says Katz. seeks what is rather than predicts relations to be found.
The massive Equality of Educational Opportunity, cited in Chapter 22, exemplifies this
type of field study. Exploratory studies have three purposes: to discover significant varia-
bles in the field situation, to discover relations among variables, and to lay the ground-
work for later, more systematic and rigorous testing of hypotheses.
Throughout this book to this point, the use and testing of hypotheses have been
emphasized. It is well to recognize, though, that there are activities preliminary to hypoth-
esis-testing in scientific research. In order to achieve the desirable aim of hypothesis-test-
ing, preliminary methodological and measurement investigation must often be done.
Some of the finest work of the twentieth century has been in this area. An example is that
done by the factor analyst, who is preoccupied with the discovery, isolation, specifica-
tion, and measurement of underlying dimensions of achievement, intelligence, aptitudes,
attitudes, situations, and personality traits.
The second subtype of exploratory field studies, research aimed at discovering or
uncovering relations, is indispensable to scientific advance in the social sciences. It is
necessary to know, for instance, the correlates of variables. Indeed, the scientific meaning
of a construct springs from the relations it has with other constructs. Assume that we
have
no scientific knowledge of the construct "intelligence": we know nothing of its causes or
concomitants. For example, suppose that we know nothing whatever about the relation of
intelligence to achievement. It is conceivable that we might do a field study in
school
situations. We might carefully observe a number of boys and girls who are said to be
have a wider vocabulary, and so on. We now have some clues to the nature of intelli-
gence, so that we can attempt to construct a simple measure of intelligence. Note that our
"definition" of intelligence springs from what presumably intelligent and nonintelligent
children do. A similar procedure can be followed with the variable "achievement."
Field studies are strong in realism, significance, strength of variables, theory orientation,
and heuristic quality. The variance of many variables in actual field settings is large,
especially when compared to the variance of the variables
of laboratory experiments.
Consider the contrast between the impact of social norms in a laboratory experiment like
Sherif's and the impact of these norms in a community where, say, certain actions of
teachers are frowned upon and others approved. Consider also the difference between
studying cohesiveness in the laboratory where subjects are asked, for example, whether
they would like to remain in a group (measure of cohesiveness) and studying the cohesive-
ness of a school faculty where staying in the group is an essential part of one's profes-
the channel that even though the effects may be strong and the variance great, it is not easy
for the experimenter to separate the variables.
The realism of field studies is obvious. Of all types of studies, they are closest to real
life.There can be no complaint of artificiality here. (The remarks about realism in field
experiments apply, a fortiori, to the realism of field studies.)
Field studies are highly heuristic. One of the research difficulties of a field study is to
keep the study contained within the limits of theproblem. Hypotheses frequently fling
themselves at one. The field is rich in discovery potential. For example, one may wish to
test thehypothesis that the social attitudes of board of education members is a determinant
of board of education policy decisions. After starting to gather data, however, many
interesting notions that can deflect the course of the investigation can arise: the relation
between the attitudes of board of education members and their election to the boards, the
relation between the scope of men's business and professional interests and their seeking
board membership, and the different conceptions of curriculum problems of board mem-
bers, administrators, teachers, and parents.
Despite these strengths, the field study is a scientific weak cousin of laboratory and
field experiments. Its most serious weakness, of course, is its nonexperimental character.
Thus statements of relations are weaker than they are in experimental research. To compli-
cate matters, the field situation almost always has a plethora of variables and variance.
Think of the many possible independent variables that we can choose as determinants of
delinquency or of school achievement. In an experimental study, these variables can be
controlled to a large extent, but in a field study they must somehow be controlled by more
indirect and less satisfactory means.
Another methodological weakness is the lack of precision in the measurement of field
variables. In field studies the problem of precision is more acute, naturally, than in field
experiments. The difficulty encountered by Astin (and others) in measuring college envi-
ronment'^ is one of many similar examples. Administrative environment, for example,
'^A. Astin. The College Environment. Washington. D.C.: American Council on Education, 1968.
Laboratory Experiments, Field Experiments, and Field Studies • 375
was measured by students' perceptions of aspects of the environment. Much of the lack of
precision due to the greater complexity of field situations."*
is
Other weaknesses of field studies are practical problems: feasibility, cost, sampling,
and time. These difficulties are really potentuil weaknesses —
none of them need be a real
weakness. The most obvious questions that can be asked are: Can the study be done with
the facilities at the investigator's disposal? Can the variables be measured? Will it cost too
much? Will it take too much time and effort? Will the subjects be cooperative? Is random
sampling possible? Anyone contemplating a field study has to ask and answer such ques-
tions. In designing research it is important not to underestimate the large amounts of time,
energy, and skill necessary for the successful completion of most field studies. The field
researcher needs to be a salesman, administrator, and entrepreneur, as well as invest-
igator.'**
Study Suggestions
Menlove, 21; Suedfeld and Rank, 21; Kounin, 22; Verba and Nie, 22; Bollen, 22.
7. There considerable controversy over the purpose of research. Two broad views, with
is
variations between, oppose each other. One of these says that the purpose of scientific research is
theory or explanation, as pointed out in Chapter I. The other view, which seems particularly
prevalent in education, is that the puqjose of research is to help improve human and social condi-
tions, to help find solutions to human and technical problems. In general, the scientist
favors the
statement? If you do, give reasons for your agreement: Why is the statement correct if, indeed, it
'*Snidies of organizations, for example, are mostly field studies, and the measurement of organizational
variables well illustrates the difficulties. "Organizational Effectiveness" appears to be as complex as "Teacher
Effectiveness." For a thorough and enlightening discussion, see: D. Katz and R. Kahn, The Social Psychology
of Organizations. 2d ed. New York: Wiley. 1978, chap. 8, especially pp. 224-226. This superb book well
repays careful reading and study.
"For details, see Katz, op. cit.. especially pp. 65ff.
376 • Types of Research
is why you don't. Before making snap judgments, read and ponder
correct? If you do not agree, say
the references given in Study Suggestion 9, below.
9. Unfortunately, there has been much uninformed criticism of experiments. Before pronounc-
ing rational judgments on any complex phenomenon we should first know what we're talking about,
and, second, we should know the nature and purpose of the phenomenon we criticize. To help you
reach rational conclusions about the experiment and experimentation, the following references are
offered as background reading.
Berkowitz L., and Donnerstein, E. "External Validity Is More than Skin Deep: Some
Answers to Criticisms of Experiments," American Psychologist, 3 (1982), 245-257. A
penetrating answer to the criticism of experiments as lacking external validity.
Kaplan, A. The Conduct of Inquiry. San Francisco: Chandler, 1964, chap. IV. This chapter
called "Experiment" seems to include most controlled observation.
Aronson, E., and Carlsmith, J. "Experimentation in Social Psychology." In G. Lindzey and
E. Aronson, eds.. The Handbook of Social Psychology. 2ded., vol. Two. Reading, Mass.:
Addison-Wesley, 1968, chap. 9.
Fisher, R. The Design of Experiments, 6th ed. New York: Hafner, 1951 A justly famous book
.
Survey Research
Survey research studies large and small populations (or universes) by selecting and
studying samples chosen from the populations to discover the relative incidence, distribu-
tion, and interrelations of sociological and psychological variables.' Surveys covered by
this definition are often called sample surveys, probably because survey research devel-
oped as a separate research activity, along with the development and improvement of
sampling procedures. Surveys, as such, are not new. Social welfare studies were done in
England as long ago as the eighteenth century. Survey research in the social scientific
sense, however, is quite new —
it is a development of the twentieth century.
'
This chapter concentrates on the use of survey research in scientific research and neglects so-called status
surveys, the aim of which is to learn the status quo among variables. There is
rather than to study the relations
no intention of derogating status surveys; they are useful, even indispensable. The intention is to emphasize the
importance and usefulness of survey research in the scientific study of socially and educationally significant
problems. The work of public opinion pollsters, such as Gallup and Roper, is also neglected. For a good account
of polls and other surveys, see M. Parten. Siin'eys. Polls, and Samples. New York; Harper & Row, 1950, chap.
1. Though old, this book is still valuable. The current standard text is; D. Warwick and C. Lininger, The Sample
Sunry: Theory and Practice. New York; McGraw-Hill, 1975. This book has the advantage of having been
guided by the thinking and practice of the Survey Research Center. University of Michigan. It also has the
advantage of having a cross-cultural emphasis.
-A. Campbell and G. Katona. "The Sample Survey; A Technique for Social-Science Research." In L.
Festinger and D. Katz, Research Methods in ihe Behavioral Sciences. New York; Holt, Rinehart and Winston,
1953. chap. 1.
378 • Types of Research
Table 24.1 Relation Between Race and Trust in People, Campbell et al.
Black 72 28
White 38 62
W = 2070.
preference, and the like.They want to know the relation between attitudes toward educa-
tion and public support of school budgets.
Only rarely, however, do survey researchers study whole populations: they study
samples drawn from populations. From these samples they infer the characteristics of the
defined population or universe. The study of samples from which inferences about popu-
lations can be drawn is needed because of the difficulties of studying whole populations.
Random samples can often furnish the same information as a census (an enumeration and
study of an entire population) at much less cost, with greater efficiency, and sometimes
greater accuracy!
Sample surveys attempt to determine the incidence, distribution, and interrelations
among sociological and psychological variables, and, in so doing, usually focus on peo-
ple, the vital facts of people, and their beliefs, opinions, attitudes, motivations, and
behavior. The social scientific nature of survey research is revealed by the nature of its
people than whites. As Campbell et al. say (p. 455), "those people who have been least
successful in their encounters with society have the least reason to feel trustful of it."
Survey researchers, of course, also study the relations among psychological variables
(see, for example. Table 10.3, Chapter 10). But most relations of survey research are
those between sociological and psychological variables: between education and tolerance
(see Table 10.9), between race and self-esteem (Table 10. 11), and between education and
sense of political efficacy (Study Suggestion 9, Chapter 10), for example.
TYPES OF SURVEYS
Surveys can be conveniently classified by the following methods of obtaining informa-
tion: personal interview, mail questionnaire, panel, and telephone. Of these, the personal
^For a complete description of such personal and social facts, see Parten, op. cit.. pp. 169-174.
'A. Campbell, P. Converse, and W, Rodgers. The Qiialily of .American Life. New York; Russell Sage,
Foundation, 1976. I calculated the percentages of Table 24. 1 from the reported percentages and frequencies of
Campbell et al.'s Table 13-11. p. 455, to show the relation clearly.
Survey Research • 379
interview far overshadows the others as perhaps the most powerful and useful tool of
social scientific survey research. These survey types will be briefly described here; in later
chapters, when studying methods of data collection, we will study the personal interview
in depth.
The best survey research uses the personal interview as the principal method of gathering
information. This accomplished in part by the careful and laborious construction of a
is
view. Much of it is neutral in character and helps the interviewer establish rapport with the
respondent. Questions of a more personal nature, such as those about income and personal
habits, and questions that are more difficult to answer such as the extent of the knowledge
or ability of the respondent, can be reserved for later questioning, perhaps at the end of the
schedule. The timing must necessarily be a matter of judgment and experience.*
Other kinds of factual information include what respondents know about the subject
under investigation, what respondents did in the past, what they are doing now, and what
they intend to do in the future. After all, unless we observe behavior directly, all data
about respondents' behavior must come from them or from other people. In this special
sense, past, present, and future behavior can all be classified under the "fact" of behav-
ior, even if the behavior is only an intention. A major point of such factual questions is
that the respondent presumably knows a good deal about his own actions and behavior. If he
says he voted for a school bond issue, we can believe him —
unless there is compelling evi-
dence to the contrary. Similarly, we can believe him, perhaps with more reservation (since
the event has not happened yet), if he says he is going to vote for a school bond issue.
Just as important, maybe even more important from a social scientific standpoint, are
the beliefs, opinions, attitudes, and feelings that respondents have about cognitive ob-
jects.^ Many of the cognitive objects of survey research may not be of interest to the
researcher: investments, certain commercial products, political candidates, and the like.
Other cognitive objects are more interesting: the United Nations, the Supreme Court,
educational practices, integration. Federal aid to education, college students, Jews, and
the women's liberation movement.
in learning respondents' reasons for doing or
The personal interview can be helpful
believing something. asked reasons for actions, intentions, or beliefs, people may
When
say they have done something, intend to do something, or feel certain ways about some-
thing. They may say that group affiliations or loyalties or certain events have influenced
them. Or they may have heard about issues under investigation via public media of
communication. For example, a respondent may say that, while he was formerly opposed
to Federal aid to education because he and his political party have always opposed govern-
ment interference, he now supports Federal aid because he has read a great deal about the
problem in newspapers and magazines and has come to the conclusion that Federal aid
will benefit American education.
A respondent's desires, values, and needs may influence his attitudes and actions.
When saying why he favors Federal aid to education the respondent may indicate that his
own educational aspirations were thwarted and that he has always yearned for more
education. Or he may indicate that his religious group has, as a part of its value structure,
a deep commitment to the education of children. If the individual under study has accu-
rately sounded his own desires, values, and needs — and can express them verbally — the
personal interview can be very valuable.
The next important type of survey research is the panel. ^ A sample of respondents is
selected and interviewed, and then reinterviewed and studied at later times. The panel
technique enables the researcher to study changes in behaviors and attitudes.
Telephone surx'eys have little to recommend them beyond speed and low cost. Espe-
cially when the interviewer is unknown to the respondent they are limited by possible
nonresponse, uncooperativeness, and by reluctance to answer more than simple, superfi-
cial questions. Yet telephoning can sometimes be useful in obtaining information essential
to a study. Its principal defect, obviously, is the inability to obtain detailed information.
The mail questionnaire, another type of survey, has serious drawbacks unless it is
used in conjunction with other techniques. Two of these defects are possible lack of
response and the inability to check the responses given. These defects, especially the first,
are serious enough to make the mail questionnaire worse than useless, except in highly
sophisticated hands. Responses to mail questionnaires are generally poor. Returns of less
than 40 or 50 percent are common. Higher percentages are rare. At best, the researcher
must content himself with returns as low as 50 or 60 percent.
As a result of low returns in mail questionnaires, valid generalizations cannot be
made.' Although there are means of securing larger returns and reducing deficiencies
follow-up questionnaires, enclosing money, interviewing a random sample of nonre-
spondents and analyzing nonrespondent data —
these methods are costly, time-consum-
ing, and often ineffective. As Parten says, "Most mail questionnaires bring so few
returns, and these from such a highly selected population, that the findings of such sur-
veys are almost invariably open to question."'" The best advice would seem to be not to
use mail questionnaires if a better method can possibly be used. If they are used, every
effort should be made to obtain returns of at least 80 to 90 percent or more, and lacking
such returns, to learn something of the characteristics of the nonrespondents.
design and the implementation of the design of studies, the unambiguous definition and
specification of the research problem, and the analysis and interpretation of data.
In the limited space of a section of one chapter, it is obviously impossible to discuss
adequately the methodology of survey research. Only those parts of the methodology
gemiane to the purposes of this book, therefore, will be outlined; the survey or study
design, the so-called fiow plan or chart of survey researchers, and the check of the
reliability and validity of the sample and the data-gathering methods. (Sampling was
discussed in Part Three, analysis in Part Four.)
Survey researchers use a "fiow plan" or chart to outline the design and subsequent
'
implementation of a survey. The fiow plan starts with the objectives of the survey, lists
'
each step to be taken, and ends with the final report. First, the general and specific
problems that are to be solved are as carefully and as completely stated as possible. Since,
in principle, there is nothing very different here from the discussion for problems and
hypotheses of Chapter 2, we can omit detailed discussion and give one simple hypotheti-
cal example. An educational investigator has been commissioned by a board of education
to study the attitudes of community members toward the school system. On discussing the
general problem with the board of education and the administrators of the school system,
the investigator notes a number of more specific problems such as: Is the attitude of the
members of the community affected by their having children in school? Are their attitudes
affected by their educational level?
One of theinvestigator's most important jobs is to specify and clarify the problem. To
do he should not expect just to ask people what they think of the schools,
this well,
although this may be a good way to begin if one does not know much about the subject.
He should also have specific questions to ask that are aimed at various facets of the
problem. Each of these questions should be built into the interview schedule. Some survey
researchers even design tables for the analysis of the data at this point in order to clarify
the research problem and to guide the construction of interview questions. Since this
procedure is recommended, let us design a table to show how it can be used to specify
survey objectives and questions.
Take the question: Is attitude related to educational level? The question requires that
"attitude" and "educational level" be operationally defined. Positive and negative atti-
tudes will be inferred from responses to schedule questions and items: If, in response to a
broad question like, "In general, what do you think of the school system here?" a
respondent says, "It is one of the best in this area," it can be inferred that he has a
positive attitude toward the schools. Naturally, one question will not be enough. Related
questions should be used, too. A definition of "educational level" is quite easy to obtain.
It is decided to use three levels: (1) Some College, (2) High School Graduate, and (3)
Non-High School Graduate. The analysis paradigm might look like Figure 24.1.
Positive
Attitude
Some
College
High School
Graduate
Non-High
School
Graduate
382 • Types of Research
The virtue of paradigms like this is that the researcher can immediately tell whether he
has stated a specific problem clearly and whether the specific problem is related to the
general problem. It also gives him some notion as to how many respondents he will need
to fill the table cells adequately, as well as provide him guidelines for coding and analysis.
In addition, as Katz says, "By actually going through the mechanics of setting out such
tables, the investigators are bound to discover complexities of a variable that need more
detailed measurement and qualifications of hypotheses in relation to special condi-
tions.
The next step in the flow plan is the sample and the sampling plan. Because sampling
is much too complex to be discussed here,'"* we outline only the main ideas. First, the
universe to be sampled and studied must be defined. Are all citizens living in the commu-
nity included; Community leaders? Those citizens paying school taxes? Those with chil-
dren of school age? Once the universe is defined, a decision is made as to how the sample
is to be drawn and how many cases will be drawn. In the best survey research, random
samples are used. Because of their high cost and greater difficulty of execution random
samples are often bypassed for quota samples. In a quota (or quota control) sample,
"representativeness" is presumably achieved by assigning quotas to interviewers so —
many men and women, so many whites and blacks, and so on. Quota sampling should be
avoided: while it may achieve representativeness, it lacks the virtues of random sampling.
The next large step in a survey is the construction of the interview schedule and other
measuring instruments to be used. This is a laborious and difficult business bearing virtu-
ally no resemblance to the questionnaires often hastily put together by neophytes. The
main task is to translate the research question into an interview instrument and into any
other instruments constructed for the survey. One of the problems of the study, for in-
stance, may be: How are permissive and restrictive attitudes toward the discipline of
children related to perceptions of the local school system? Among the questions to be
written to assess permissive and restrictive attitudes, one might be: How do you feel
children should be disciplined? After drafts of the interview schedule and other instru-
ments are completed, they are pretested on a small representative sample of the universe.
They are then revised and put in final form.
The steps outlined above constitute the first large part of any survey. Data collection is
the second large part. Interviewers are oriented, trained, and sent out with complete
instructions as to whom to interview and how the interview is to be handled. In the best
surveys, interviewers are allowed no latitude as to whom to interview. They must inter-
view those individuals and only those individuals designated, generally by random de-
vices. Some latitude may be allowed in the actual interviewing and use of the schedule,
but not much. The work of interviewers is also systematically checked in some manner.
For example, every tenth interview may be checked by sending another interviewer to the
same respondent. Interview schedules are also studied for signs of spurious answering and
reporting.
The third large part of the flow plan is analytical. The responses to questions are coded
and tabulated. Coding is the term used to describe the translation of question responses
'-D. Katz. "Field Studies," in Festinger and Katz, op. cit.. pp. 80-81.
'^See Chapter 8, above. Warwick and Lininger's discussions of sampling are helpful: op. til., chaps 4, 5.
Their Chapter 5. a detailed example of multistage area sampling, is especially helpful. Area sampling is the type
most used in survey research. First, defined large areas are sampled at random. This amounts to partitioning of
the universeand random sampling of the cells of the partition. The partition cells may be areas delineated by
gridson maps or aerial photographs of counties, school districts, or city blocks. Then further subarea samples
may be drawn at random from the large areas already drawn. Finally, all individuals or families or random
samples of individuals and families may be drawn.
Survey Research • 383
and respondent information to specific categories for purposes of analysis. Take the '''
example of Figure 24. All respondents must be assigned to one of the three educational-
1 .
level categories and a number (or other symbol) assigned to each level. Then each person
must also be assigned to a "positive attitude" or "negative attitude" category. To aid in
the coding, content analysis may be used. Content analysis is an objective and quantitative
method for assigning types of verbal and other data to categories (see Chapter 30, below).
Coding can mean the analysis of factual response data and then the assignment of individ-
uals to classes or categories, or the assigning of categories to individuals, especially if one
is preparing cards for computer analysis. Such cards consist of a large number of columns
with a number of cells in each column. The fifth column may be assigned, say, to sex, and
the first two cells of the column, or the numbers and 1, used to designate female and
male.
Tabulation is the recording of the numbers of types of responses in the appropriate
categories, after which statistical analysis follows: percentages, averages, relational indi-
ces, and appropriate tests of significance. The analyses of the data are studied, collated,
assimilated, and interpreted. Finally, the results of this interpretative process are reported.
Survey research has a unique advantage among social scientific methods: it is often possi-
ble to check the validity of survey data. Some of the respondents can be interviewed
again, and the results of both interviews checked against each other. It has been found that
the reliability of personal factual items, like age and income, is high.'^ The reliability of
attitude responses is harder to determine because a changed response can mean a changed
attitude. The reliability of average responses is higher than the reliability of individual
responses. Fortunately, the researcher is usually more interested in averages, or group
measures, than in individual responses.
One way of checking the measuring instrument is to use an outside
validity of a
criterion. One compares one's outside, presumably valid, criterion. For
results to some
instance, a respondent tells us he voted in the last election of school board members.
We can find out whether he did or not by checking the registration and voting rec-
ords. Ordinarily, individual behavior is not checked, because information about individ-
uals is hard to obtain, but group information is often available. This information can be
used to test to some extent the validity of the survey sample and the responses of the re-
spondents.
A good example of an outside check on survey data is the use of information from the
last census. This is particularly useful in large-scale surveys, but it may also help in
smaller ones. Proportions of men and women, races, educational levels, age, and so on in
the sample and in the U.S. census are compared. In the Verba and Nie study of political
example, the authors report a number of such comparisons. '^ Table 24.2
participation, for
reports It is obvious that the sample estimates are accurate: only one of
some of them.
them. Age 20-34, deviates from the Census estimates by more than 2 percent, which is
'"Simple coding is discussed in Warwick and Lininger, op. cil.. chap. 9. For detailed discussions of coding
and coding problems, instructional materials are available from the Institute for Social Research, University of
Michigan, Ann Arbor, Michigan 48104. The Institute also issues bibliographies on survey research and related
matters.
'^Parten, op. cit.. pp. 496-498.
"*S. Verba and N. Nie, Participation in America: Political Democracy and Social Equality. New York:
Table 24.2 Comparison of Sample and Census Data, Verba and Nie
Survey"
Characteristic
Survey Research • 385
higher status citizens whose participation is influential. The authors point out that al-
though Americans are not noted tor class-based ideology, social status does relate to
participation.-" The study was especially characterized by sophisticated measurement and
analytic methodology, and by a major disconcerting finding. We will return to it and its
methodology in later chapters.
This study explicitly stated and tested alternative theories of racial prejudice. Indeed, the
authors deplored the lack of empirical confrontation between alternative theories in the
prejudice literature (p. 414). The study is also unusual because its circumstances "permit-
ted . . . the rare luxury of a complete replication"' (albeit with small samples).
The theories of prejudice tested were called symbolic racism and racial threat. 5wi-
bolic racism, say the authors, is a blend of antiblack sentiment and traditional moral
values of the Protestant Ethic. It is resistance to change in the racial status quo based on
moral feelings that blacks violate traditional American values of self-reliance, individual-
ism, the work ethic, and discipline. This syndrome of determinants of racial prejudice is
the descendant of an older social-cultural learning theory, which emphasized that preju-
dice was learned by children along with other normative values and attitudes. But, say
Kinder and Sears, white America has become, at least in principle, racially egalitar-
ian, and another explanation is therefore necessary.
The alternative explanation, group conflict theory, emphasizes the threats that blacks
pose to the private lives of whites; moving into white neighborhoods, moving into better
jobs displacing whites, insistence on integrated schools and enforced busing and racial
mixing of young children, and the threat of rising violence perceived as due to black
criminality. The implications of both theories were used to construct measures of personal
racial threats and symbolic racism, which were related to the dependent variable, intended
votes for two candidates for mayor of Los Angeles in 1969 and 1973, Yorty, a white
conservative, and Bradley, a black liberal.
White residents of two suburban communities in Los Angeles were selected in what
appears to have been quota sampling. Two samples of adults were interviewed in 1969
and in 1973. The questions asked focused mainly on issues relating to personal threat and
symbolic racism. The researchers' analysis indicated that symbolic racism was a more
important form of racial prejudice than racial threat. That is, symbolic racism accounted
for more of the variance of candidate preference than racial threat. Various control analy-
ses further supported this finding. The authors concluded that "racial attitudes were major
determinants of voting in both mayoral elections" (ibid., p. 427), and racial threats to
whites' lives were largely irrelevant. These findings and conclusions are highly important
both theoretically and practically.
These two studies clearly show toward using survey research as a tool to test
a trend
theory and hypotheses in contrast to older use in which the emphasis was on finding
"what is there." We can expect this trend to continue and to grow, especially in sociol-
ogy, political science, and education.
-'^Ibid.. p. 339.
^'D. Kinder and D. Sears. "Prejudice and Politics: Symbolic Racism versus Racial Threats to the Good
Life," Journal of Personality and Social Psychology. 40 (1981), 414-431.
386 • Types of Research
ence. Survey research's strong emphases on representative samples, overall design and
plan of research, and expert interviewing using carefully and competently constructed
interview schedules have had and will continue to have beneficial influence on behavioral
research. Despite its evident potential value in all behavioral research fields, survey re-
search has not been used to any great extent where it would seem to have large theoretical
and practical value: in education.-'^ Its distinctive usefulness in education and educational
research seems not to have been realized. This section is therefore devoted to application
of survey research to education and educational problems.
Obviously, survey research is a useful tool for educational fact-finding. An adminis-
trator, a board of education, or a staff of teachers can learn a great deal about a school
system or a community without contacting every child, every teacher, and every citizen.
In short, the sampling methods developed in survey research can be very useful. It is
unsatisfactory to depend upon relatively hit-or-miss, so-called representative samples
based on "expert" judgments. Nor is it necessary to gather data on whole populations;
samples are sufficient for many purposes.
Most research in education is done with relatively small nonrandom samples. If hy-
potheses are supported, they can later be tested with random samples of populations, and
ifagain supported, the results can be generalized to populations of schools, children, and
laymen. In other words, survey research can be used to test hypotheses already tested in
more limited situations, with the result that external validity is increased.
Survey research seems ideally suited to some of the large controversial issues of
education. For example, its ability to handle "difficult" problems like integration and
school closings through careful and circumspect interviewing puts it high on the list of
research approaches to such problems. Researchers and educators can study the impact of
integration and of school closings on communities and their school systems. Interviews of
random samples of citizens and teachers of school districts just starting integration or the
closing of certain elementary schools because of declining enrollment can provide valua-
ble information on the concerns and fears of citizens so that appropriate measures to
inform them and lessen their fears can be taken. The effect of these measures can of
course also be studied.
Survey research is probably best adapted to obtaining personal and social facts, be-
liefs, and attitudes. It is significant that, although hundreds of thousands of words
are spoken and written about education and about what people presumably think about
education, there is little dependable information on the subject. We simply do not
know what people's attitudes toward education are. We have to depend on feature
writers and so-called experts for this information. Boards of education frequently de-
pend on administrators and local leaders to tell them what the people think. Will they
support an expanded budget next year? What will they think about a merger of adjoin-
ing school districts? How will they react to busing white and black children to achieve
desegregation?
^^An outstanding example of survey research in education, however, a study that should be read by educa-
tional administratorsand board of education members, is: N. Gross, W. Mason, and A. McEachem, Explora-
tions in Role Analysis: Studies of the School Superintendency Role. New York: Wiley, 1958. A shorter report is:
is usually emphasized at the expense of depth. This seems to be a weakness, however, that
is not necessarily inherent in the method. The Verba and Nie and other studies show that it
is possible to go considerably below surface opinions. Yet the survey seems best adapted
to extensive rather than intensive research. Other types of research are perhaps better
adapted to deeper exploration of relations.
A second weakness is a practical one. Survey research is demanding of time and
money. In a large survey, it may be months before a single hypothesis can be tested.
Sampling and the development of good schedules are major operations. Interviews require
skill, time, and money. Surveys on a smaller scale can avoid these problems to some
extent, even though it is generally true that survey research demands large investments of
time, energy, and money.
Any research that uses sampling is naturally subject to sampling error. While it is true
that survey information has been found to be relatively accurate, there is always the one
chance in twenty or a hundred that an error more serious than might be caused by minor
fluctuations of chance may occur. The probability of such an error can be diminished by
building safety checks into a study —by comparing census data or other outside informa-
tion and by sampling the same population independently.
A potential rather than an actual weakness of this method is that the survey interview
can temporarily lift the respondent out of his own social context, which may make the
results of the survey invalid. The interview is a special event in the ordinary life of the
respondent. This apartness may affect the respondent so that he talks to, and interacts
with, the interviewer in an unnatural manner. For example, a mother, when queried about
her child-rearing practices, give answers that reveal methods she would like to use
may
rather than those she does use. It is possible for interviewers to limit the effects of lifting
respondents out of social context by skilled handling, especially by one's manner and by
careful phrasing and asking of questions.-''
Survey research also requires a good deal of research knowledge and sophistication.
The competent survey investigator must know sampling, question and schedule construc-
tion, interviewing, the analysis of data, and other technical aspects of the survey. Such
knowledge is hard to come by. Few investigators get this kind and amount of experience.
As the value of survey research, both large- and small-scale, becomes appreciated, it can
be anticipated that such knowledge and experience will be considered, at least in a mini-
mal way, to be necessary for researchers.
"C. Cannell and R. Kahn, "Interviewing." InG. Lindzey and E. Aronson. eds.. The Handbook of Social
Psychology. 2d ed. Reading, Mass.: Addison-Wesley. 1968, chap. 15.
388 • Types of Research
Study Suggestions
1. Here are several good examples of survey research. Choose one of them and read the first
chapter (or chapters) to learn the problem of the study. Then go to the technical methodological
appendix to see how the sampling and interviewing were done. (Most published survey research
studies have such appendices.) Try to determine the main variables and their relations.
2. This study suggestion is for students of education. Read as much of the Gross, Mason, and
McEachem survey research study (footnote 22) as you can. It is a methodological model for larger-
scale educational research. And, of course, it reports a number of interesting findings about superin-
tendents and boards of education and their views. Perhaps most important, it shows that a scientific
approach and practical concern for educational practice can be combined.
3. Rensis Likert, an outstanding social scientist, a methodological pioneer of survey research,
and the founder of the Institute for Social Research of the University of Michigan (of which the
Survey Research Center is a part), recently died. Two of his colleagues wrote an obituary in which
they described Likert's contributions.''* It is suggested that students read the obituary, which is
virtually an account of the birth and growth of important methodological aspects of survey research,
as well as an interesting description of the contributions of this creativeand competent individual.
"S. Seashore and D. Katz, "Obituary; Rensis Likert (1903-1981)," American Psychologist. 37 (1982),
851-853.
PART EIGHT
Measurement
Chapter 25
Foundations
of Measurement
"In its broadest sense, measurement is the assignment of numerals to objects or events
according to rules."' This definition succinctly expresses the basic nature of measure-
ment. To understand it, however, requires the definition and explanation of each impor-
tant term — a task to which much of this chapter is devoted.
Suppose we ask a judge to stand seven feet away from a group of students. The judge
is asked to look at the students and then to estimate the degree to which each of them
possesses five attributes: niceness, strength of character, personality, musical ability, and
intelligence. The estimates are to be given numerically with a scale of numbers from 1 to
5, 1 indicating a very small amount of the characteristic in question and 5 indicating a
great deal of it. In other words, the judge, just by looking at the students, is to assess how
"nice"" they are, how "strong"" their characters are, and so on, using the numbers I, 2,
3, 4, and 5 to indicate the amounts of each characteristic they possess.
This example may be a little ridiculous. Most of us, however, go through much the
same procedure all our lives. We often judge how "nice," how "strong," how "intelli-
gent"" people are simply by looking at them and talking to them. It only seems silly when
it is given as a serious example of measurement. Silly or serious, it is an example of
measurement, since it satisfies the definition. The judge assigned numerals to "objects"
according to rules. The objects, the numerals, and the rules for the assignment of the
numerals to the objects were all specified. The numerals were 1,2,3,4, and 5; the objects
were the students; the rules for the assignment of the numerals to the objects were con-
tained in the instructions to the judge. Then the end-product of the work, the numerals,
might be used to calculate measures of relation, analyses of variance, and the like.
The definition of measurement includes no statement about the quahty of the measure-
ment procedure. It simply says that, somehow, numerals are assigned to objects or to
events. The '"somehow," naturally, is important —
but not to the definition. Measurement
is a game we play with objects and numerals. Games have rules. It is. of course, important
for other reasons that the rules be "good" rules, but whether the rules are "good" or
"bad," is still measurement.
the procedure
Why emphasis on the definition of measurement and on its "rule" quality? There
this
are three reasons. First, measurement, especially psychological and educational measure-
ment, is misunderstood. It is not hard to understand certain measurements used in the
natural sciences —
length, weight, and volume, for example. Even measures more re-
moved from common sense can be understood without wrenching elementary intuitive
notions too much. But to understand that the measurement of such characteristics of
individuals and groups as intelligence, aggressiveness, cohesiveness, and anxiety in-
volves basically and essentially the same thinking and general procedure is much harder.
Indeed, many say it cannot be done. Knowing and understanding that measurement is the
assignment of numerals to objects or events by rule, then, helps to erase erroneous and
misleading conceptions of psychological and educational measurement.
Second, the definition tells us that, if rules can be set up on some rational or empirical
basis, measurement of anything is theoretically possible. This greatly widens the scien-
tist's measurement horizons. He will not reject the possibility of measuring some property
because the property is complex and elusive. He understands that measurement is a game
that he may or may not be able to play with this or that property at this time. But he never
rejects the possibility of playing the game, though he may realistically understand its
difficulties.
Third, the definition alerts us to the essential neutral core of measurement and meas-
urement procedures and to the necessity for setting up "good" rules, rules whose virtue
can be empirically tested. No measurement procedure is any better than its rules. The
rules given in the example above were poor. The procedure was a measurement proce-
dure; the definition was satisfied. But it was a poor procedure for reasons that should
become apparent later.
DEFINITION OF MEASUREMENT
To repeat our definition, "measurement is the assignment of numerals to objects or events
according to rules." A AiMwera/ is a symbol of the form: 1, 2. 3 or I. II. Ill
Figure 25.1
of the other set can be numerals or numbers. In most psychological and educational
measurement, numerals and numbers are mapped onto, or assigned to, individuals.^
— —
The most interesting and difficult work of measurement is the rule. A rule is a
guide, a method, a command that tells us what to do. A mathematical rule is/, a function;
/is a rule for assigning the objects of one set to the objects of another set. In measurement
a rule might say: "Assign the numerals 1 through 5 to individuals according to how nice
they are. If an individual is very, very nice, let the number 5 be assigned to him. If an
individual is not at all nice, let the number 1 be assigned. Assign to individuals between
these limits numbers between the limits." Another rule is one we have already met a
number of times: "If an individual is male, assign him a 1. If an individual is female,
assign her a 0." Of course, we would have to have a prior rule or set of rules defining
male and female.
Assume that we have a set, A, of five persons, three men and two women: a^, a^. and
a4 are men; 02 and as are women. We wish to measure the variable, sex. Assuming we
have a prior rule that allows us unambiguously to determine sex, we use the rule given in
the preceding paragraph: "If a person is male, assign 1 if female, assign 0. " Let
; and 1
be a set. Call itS. Then fi = {0, 1}. The measurement diagram is shown in Figure 25.1.
This procedure is the same as the one we used in Chapter 5 when discussing relations
and functions. Evidently measurement is a relation. Since, to each member of A, the
domain, one and only one object of B, the range, is assigned, the relation is a function.
Are all measurement procedures functions, then? They are, provided the objects being
measured are considered the domain and the numerals being assigned to, or mapped onto
them, are considered the range.
Here is another way to bring set, relation-function, and measurement ideas together.
Recall that a relation is a set of ordered pairs. So is a function. Any measurement proce-
dure, then, sets up a set of ordered pairs, the first member of each pair being the object
measured, and the second member the numeral assigned to the object according to the
-Usually, in a mapping, the members of the domain are said to be mapped onto members of the range. In
order to preserve consistency with the definition of measurement given above and to be able always to conceive
of the measurement procedure as a function, the mapping has been turned around. This conception of mapping,
furthermore, is consistent with the earlier definition of a function as a rule that assigns to each member of the
domain of a set some one member of the range. The rule tells how the pairs are to be ordered.
394 • Measurement
Domain Range
(children) (ranks)
Figure 25.2
measurement rule, whatever it is. We can thus write a general equation for any measure-
ment procedure:
This is read: "The function,/, or the rule of correspondence, is equal to the set of ordered
pairs (.V, y) such that x isan object and each corresponding y is a numeral."" This is a
general rule and will fit any case of measurement.
Let us cite another example to make
more concrete. The events to be
this discussion
measured, the x's, are five children. The numerals,
1,2.3,4, and 5.
the y"s, are the ranks
Assume that /is a rule that instructs a teacher as follows: "Give the rank to the child 1
who has the greatest motivation to do schoolwork. Give the rank 2 to the child who has the
next greatest motivation to do schoolwork, and so on to the rank 5 which you should give
'
to the child with the least motivation to do schoolwork. The measurement or the function '
Note that/, the rule of correspondence, might have been: "If a child has high motiva-
tion for schoolwork, give him a but if a child has low motivation for schoolwork, give
1 ,
him a 0." Then the range becomes {0, 1}. This simply means that the set of five children
has been partitioned into two subsets, to each of which will be assigned, by means of/,
the numerals and A diagram of this would look like Figure 25. with the set A being
1 . 1
difficult to devise clear rules that are "good." Nonetheless, we must always have rules of
some kind in order to measure anything.
Foundations of Measurement • 395
Figure 25.3
"Reality"
Measurement
the "true" values of persistence run from through 8, whereas your measurement system
only encompasses 1 through 7.
While this example is a bit fanciful, it does show in a crude way the nature of the
isomorphism problem. The ultimate question to be asked of any measurement procedure
is: Is the measurement procedure isomorphic to reality? You were not too far off in
measuring persistence. The only trouble is that we rarely discover as simply as this the
degree of correspondence to "reality" of our measurements. In fact, we often do not even
know whether we are measuring what we are trying to measure! Despite this difficulty,
scientists must test, in some manner, the isomorphism with "reality" of the measurement
numbers games they play.
tions, definitions that specify the activities or "operations" necessary to measure varia-
bles or constructs. A construct is an invented name for a property. Many constructs have
been used in previous chapters: authoritarianism, achievement, social class, intelligence,
commonly and somewhat inaccurately called varia-
persistence, and so on."* Constructs,
two general ways in science: by other constructs and by experimental
bles, are defined in
and measurement procedures. These were earlier called constitutive and operational defi-
nitions. An operational definition is necessary in order to measure a property or a construct.
This done by specifying the observations of the behavioral indicants of the properties.
is
Numerals are assigned to the behavioral indicants of properties. Then, after making
observations of the indicants, the numbers (numerals) are substituted for the indicants and
analyzed statistically. As an example, consider investigators who are working on the
relation between intelligence and honesty. They operationally define intelligence as scores
on an intelligence test. Honesty is operationally defined as observations in a contrived
situation permitting pupils to cheat or not to cheat. The intelligence numerals assigned to
pupils can be the total number of items or some other form of score.
correct on the test,
The honesty numerals assigned to pupils are the number of times they did not cheat when
they could have cheated. The two sets of numbers may be correlated or otherwise ana-
lyzed. The coefficient of correlation, say, is .55, significant at the .01 level. All this is
fairly straightforward and quite familiar. What is not so straightforward and familiar is
this: if the investigators draw the conclusion that there is a significant positive relation
between intelligence and honesty, they are making a large inferential leap from behavior
indicants in the form of marks on paper and observations of "cheating" behavior to
psychological properties. That they may be mistaken should be quite obvious.
°"
LEVELS OF MEASUREMENT AND SCALING
Levels of measurement, the scales associated with the levels, and the statistics appropriate
complex, even controversial, problems. The difficulties arise mainly from
to the levels are
disagreement over the statistics that can legitimately be used at the different levels of
measurement. The Stevens' position and definition of measurement cited earlier is a broad
view that, with liberal relaxation, is followed in this text.^ In the ensuing discussion, we
'
"The concepts or constructs under discussion are also called "latent variables. ' This is an important expres-
sion that is being fruitfully usedwhat has been called analysis of covariance structures, or so-called causal
in
analysis. A latent variable is a construct, an unobserved variable, that is presumed to underlie varied behaviors,
and that is used to '"explain" these behaviors. "Verbal ability," "conservatism," and "anxiety," for example,
are latent variables. Their use will be explained later in the book when we study factor analysis and analysis of
covariance structures.
-"'A more restrictive — yet defensible — position requires that differences between measures be interpretable
measured. "Quantitative," in the view of some experts, means that a
as quantitative differences in the property
difference in magnitude between two attribute values represents a corresponding quantitative difference in the
attributes. See L. Jones, "The Nature of Measurement." In R. Thomdike, ed., Educational Measurement, 2d
ed. Washington, D.C.: American Council on Education. 1971, pp. 335-355. This view, strictly speaking, rules
out, as measurement, nominal and ordinal scales. I believe that actual measurement experience in the behavioral
it does not matter terribly, provided the student
sciences and education justifies a more relaxed position. Again,
understands the general ideas being presented. 1 recommend that the more advanced student read Torgerson's
and Nunnally's fine presentations: W. Torgerson. Theory and Methods of Scaling. New York: Wiley, 1958,
chaps. 1 and 2; J. Nunnally. Psychometric Theory. 2d ed. New York: McGraw-Hill, 1978. chap. 1. An older,
outstanding treatise that has strongly influenced this text is: J. Guilford, Psychometric Methods. 2d ed. New
York: McGraw-Hill. 1954, chap. I The curious student will enjoy the collection of articles on
.
the controversy
published in R, Kirk. ed.. Statistical Issues: A Reader for the Behavioral Sciences. Monterey. Calif.: Brooks/
Cole. 1972. chap. 2. Readers who intend doing research and who
always be faced with measurement
will
problems should carefully and repeatedly read Nunnally's excellent presentation of the problems and their
solution: Nunnally. op. cit.. pp. 24-33.
398 • Measurement
first consider the fundamental scientific and measurement problem of classification and
enumeration.
The first and most elementary step in any measurement procedure is to define the objects
of the universe of discourse. Suppose U. the universal set, is defined as all tenth-grade
pupils in a certain high school. Next, the properties of the objects of U
must be defined.
All measurement requires that U be broken down into at least two subsets. The most
elementary form of measurement would be to classify or categorize all the objects as
possessing or not possessing some characteristic. Say this characteristic is maleness. We
break U down into males and nonmales, or males and females. These are of course two
subsets of U, or a partitioning of U. (Recall that partitioning a set consists of breaking it
down into subsets that are mutually exclusive and exhaustive. That is, each set object must
be assigned to one subset and one subset only, and all set objects in U must be so
assigned.)
What we have done is to classify the objects of interest to us. We have put them into
categories: we have partitioned them. The obvious simplicity of this procedure seems to
cause difficulty for students. People spend much of their lives categorizing things, events,
and people. Life could not go on without such categorizing, yet to associate the process
with measurement seems difficult.
After a method of classification has been found, we have in effect a rule for telling
which objects of U go into which classes or subsets or partitions. This rule is used and the
set objects are put into the subsets. Here are the ooys; here are the girls. Easy. Here are the
middle-class children; here are the working-class children. Not as easy, but not too hard.
Here are the delinquents; here are the nondelinquents. Harder. Here are the bright ones;
here are the average ones; here are the dull ones. Much harder. Here are the creative ones;
here are the noncreative ones. Very much harder.
After the objects of the universe have been classified into designated subsets, the
members of the sets can be counted. In the dichotomous case, the rule for counting was
given in Chapter 4: If a member of U has the characteristic in question, say maleness, then
assign a 1. If the member does not have the characteristic, then assign a 0. (See Figure
25.1.) When set members are counted in this fashion, all objects of a subset are consid-
ered to be equal to each other and unequal to the members of other subsets.
Nominal Measurement
There are four general levels of measurement: nominal, ordinal, interval, and ratio. These
four levels lead to four kinds of scales. Some writers on the subject admit only ordinal,
interval, and ratio measurement, while others say that all four belong to the measurement
family. We need not be too fussy about this as long as we understand the characteristics of
the different scales and levels.
The rules used to assign numerals to objects define the kind of scale and the level of
measurement. The lowest level of measurement is nominal measurement (see earlier
discussion of categorization). The numbers assigned to objects are numerical without
having a number meaning; they cannot be ordered or added. They are labels much like the
letters used to label sets. If individuals or groups are assigned 1, 2, 3, ... these ,
numerals are merely names. For example, baseball and football players are assigned such
numbers. Telephones are assigned such numbers. Groups may be given the labels I, II,
and III or A|, At, and A3. We use nominal measurement in our everyday thinking and
Foundations of Measurement • 399
Ordinal Measurement
Ordinal measurement requires that the objects of a set can be rank-ordered on an opera-
tionally defined characteristic or property. The so-called transitivity postulate must be
satisfied: Ifa is greater than b, and b is greater than c. then a is greater than c. Other
symbols or words can be substituted for "greater than," for example, "less than," "pre-
cedes," "dominates," and so on. Most measurement in behavioral research depends on
this postulate. It must be possible to assert ordinal or rank-order statements like the one
just used. That is, suppose we have three objects, a, b, and c, and a is greater than b, and
b is greater than c. If we can justifiably say, also, that a is greater than c, then the main
condition for ordinal measurement is satisfied. Be wary, however. A relation may seem to
satisfy the transitivity postulate but may not actually do so. For example, can we always
say: a dominates b, and b dominates c; therefore a dominates c? Think of husband, wife,
and child. Think, too, of the relations "loves," "likes," "is friendly to," or "accepts."
In such cases, transitivity should be demonstrated by the researcher.
The procedure can be generalized in three ways. One, any number of objects of any
kind can be measured ordinally simply by extension to a, ^, c, ,n. (Even though two . . .
objects may sometimes be equal, ordinal measurement is still possible.) We simply need
to be able to say a > fo > c > > m on some property.
• •
in which the transitivity postulate is satisfied: aO b might mean "a precedes fe," or "a is
a is superior to c."
The numerals assigned to ranked objects are called rank values. Let R equal the set of
ranked objects: R = {a > b > > n}.LctR* equal the set of ra«^ values: R* = {1,2.
.... m}. We assign the objects of ^* to f.ie objects of ^ as follows: the largest object is
400 • Measurement
assigned 1, the next in size 2, and so on to the smallest object which is assigned the last
numeral in the particular series. If this procedure is used, the rank values assigned are in
the reverse order. If, for instance, there are five objects, with a the largest, b the next,
through e, the smallest, then:
Objects R R*
a I 5
b 2 4
c 3 3
d 4 2
e 5 1
underlying properties they represent are equally spaced. If two subjects have the ranks 8
and 5 and two other subjects the ranks 6 and 3, we cannot say that the differences between
the first and second pairs are equal. There is also no way to know that any individual has
none of the property being measured. Rank-order scales are not equal-interval scales, nor
do they have absolute zero points.
Interval or equal-interval scales possess the characteristics of nominal and ordinal scales,
especially the rank-order characteristic. In addition, numerically equal distances on inter-
val scales represent equal distances in the property being measured. Thus, suppose that we
had measured four objects on an interval scale and gotten the values 8, 6, 5, and 3. Then
we can legitimately say that the difference between the first and third objects in the
property measured, 8 — 5 = 3, is equal to the difference between the second and fourth
objects, 6-3 = 3. Another way to express the equal-interval idea is to say that the
intervals can be added and subtracted. An interval scale is assumed as follows:
12 3 4 5
meaning. II a measurement is zero on a ratio scale, then there is a basis for saying that
some ohject has none of the property being measured. Since there is an absolute or natural
zero, all arithmetic operations are possible, including multiplication and division. Num-
bers on the scale indicate the actual amounts of the property being measured. If a ratio
scale of achievement existed, then it would be possible to say that a pupil with a scale
score o{ 8 has an achievement twice as great as a pupil with a scale score of 4.
scales and tests used in psychological and educational measurement approximate interval
measurement well enough for practical purposes, as we shall see.
First, consider nominal measurement. When objects are partitioned into two, three, or
—
more categories on the basis of group membership sex, ethnic identification, married-
single, Protestant-Catholic-Jew, and so forth —
measurement is nominal. When continu-
ous variables are converted to attributes, as when objects are divided into high-low and
old-young, we have what can be called quasi-nominal measurement: although capable of
at least rank order, the values are in effect collapsed to and 0. 1
It is instructive to study the numerical operations that are, in a strict sense, legitimate
with each type of measurement. With nominal measurement the counting of numbers of
cases in each category and subcategory is, of course, permissible. Frequency statistics like
"True" Scale
Ordinal Scale
Figure 25.5
distances are equal (empirically). The situation might be somewhat as indicated in Figure
25.5. The scale on the top ("true" scale) indicates the "true" values of a variable. The
bottom scale (ordinal scale) indicates the rank-order scale used by an investigator. In other
words, an investigator has rank-ordered seven persons quite well, but his ordinal numer-
als, which look equal in interval, are not "true," although they may be fairly accurate
representations of the empirical facts.
Strictly speaking, the statistics that can be used with ordinal scales include rank-order
measures such as the rank-order coefficient of correlation, p, Kendall's W, and rank-order
analysis of variance, medians, and percentiles. If only these statistics (and others like
them) are legitimate, how can statistics like r, t, and F be used with what are in effect
ordinal measures'? And they are so used, without a qualm by most researchers.
Although this is a moot point, the situation is not as difficult as it seems. As Torgerson
points out, some types of natural origin have been devised for certain types of measure-
ment.* In measuring preferences and attitudes, for example, the neutral points (on either
side of which are degrees of positive and negative favoring, approving, liking, and prefer-
ring) can be considered natural origins. Besides, ratio scales, while desirable, are not
absolutely necessary because most of what we need to do in psychological measurement
can be done with equal-interval scales.
The lack of equal intervals is more serious since distances within a scale theoretically
cannot be added without interval equality. Yet, though most psychological scales are
basically ordinal, we can with considerable assurance often assume equality of interval.
The argument is evidential. If we have, say, two or three measures of the same variable,
and these measures are all substantially and linearly related, then equal intervals can be
assumed. This assumption is valid because the more nearly a relation approaches linear-
ity, the most nearly equal are the intervals of the scales. This also applies, at least to some
extent, to certain psychological measures like intelligence, achievement, and attitude tests
and scales.
A related argument is that many of the methods of analysis we use work quite well
with most psychological scales. That is, the results we get from using scales and assuming
equal intervals are quite satisfactory.
The point of view adopted in this book is, then, a pragmatic one, that the assumption
of interval equality works. Still, we are faced with a dilemma: if we use ordinal measures
as though they were interval or ratio measures, we can and the
err in interpreting data
relations inferredfrom data, though the danger is probably not as grave as it has been
made out to be. There is no trouble with the numbers, as numbers. They do not know the
difference between p and r or between parametric and nonparametric statistics, nor do
they know the assumptions behind their use. But we do, or should, know the differences
and the consequences of ignoring the differences. On the other hand, if we abide strictly
by the rules, we modes of measurement and analyses and are left with
cut off powerful
tools inadequate to cope with the problems we want to solve.'
What is the answer, the resolution of the conflict? Part of the answer was given above:
it is probable that most psychological and educational scales approximate interval equality
fairly well. In those situations in which there is serious doubt as to interval equality, there
are technical means for coping with some of the problems. The competent research
worker should know something of scaling methods and certain transformations that
change ordinal scales into interval scales.'^
In the state of measurement at present, we cannot be sure that our measurement
instruments have equal intervals. It is important to ask the question: How serious are the
distortions and errors introduced by treating ordinal measurements as though they were
interval measurements? With care in the construction of measuring instruments, and espe-
cially with care in the interpretation of the results, the consequences are evidently not
serious.
The best procedure to be to treat ordinal measurements as though they
would seem
were interval be constantly alert to the possibility o\\^ross inequality
measurements, but to
of intervals. As much as possible about the characteristics of the measuring tools should
be learned. Much useful information has been obtained by this approach, with resulting
scientific advances in psychology, sociology, and education. In short, it is unlikely that
researchers will be led seriously astray by heeding this advice, if they are careful in
applying it.'*
{Note: Study suggestions for the three chapters of Part 8 appear at the end of Chapter
27.)
points.
''Again, see Nunnally, op. cit.. pp. 24-33. for an enlightened discussion of these and related
*M. Bartlett,"The Use of Transformations." Biometrics. 3 (1947), 39-52 (especially pp. 49-50); Guil-
ford, op. cil.. chap. 8. The subject of transformations and their purposes and
uses is an important one, but has
given the attention it deserves. Two authoritative and informative treatments are: G. Snedecor
and W.
not been
F. Mosteller
Cochran. Statistical Methods. 6th ed. Ames, Iowa: Iowa State University Press, 1967, pp. 325ff.;
Psychology, vol. I.
and R. Bush. "Selected Quantitative Techniques." In G. Lindzey, ed.. Handbook of Social
Cambridge. Mass.: Addison-Wesley. 1954. pp. 324-328.
Gardner.
'A useful review of the literature on the problem of scales of measurement and statistics is: P.
"Scales and Statistics," Review of Educational Research. 45 (1975), 43-57.
Chapter 26
•Reliability
After assigning numerals to objects or events according to rules, we must face two
major problems of measurement: reliability and validity. We have devised a measurement
game and have administered the measuring instruments to a group of subjects. We must
now ask and answer the questions: What is the reliability of the measuring instrument?
What is its validity?
If one does not know the reliability and validity of one's data little faith can be put in
the results obtained and the conclusions drawn from the results. The data of the social
sciences and education, derived from human behavior and human products, are, as we
saw in the last chapter, some steps removed from the properties of scientific interest. Thus
their validity can be questioned. Concern for reliability comes from the necessity for
dependability in measurement. The data of all psychological and educational measure-
ment instruments contain errors of measurement. To the extent that they do so, to that
extent the data they yield will not be dependable.
DEFINITIONS OF RELIABILITY
Synonyms for reliability are: dependability, stability, consistency, predictability, accu-
racy. Reliable people, for instance, are those whose behavior is consistent, dependable.
Reliability • 405
predictable —
what they will do tomorrow and next week will be consistent with what they
do today and what they have done last week. They are stable, we say. Unreliable people,
on the other hand, are those whose behavior is much more variable. They are unpredict-
ably variable. Sometimes they do this, sometimes that. They lack stability. We say they
are inconsistent.
So it is with psychological and educational measurements; they are more or less
variable from occasion to occasion. They are stable and relatively predictable or they are
unstable and relatively unpredictable; they are consistent or not consistent. If they are
reliable, we can depend upon them. If they are unreliable, we cannot depend upon them.
It is possible to approach the definition of reliability in three ways. One approach is
epitomized by the question; If we measure the same set of objects again and again with the
same or comparable measuring instrument, will we get the same or similar results? This
question implies a definition of reliability in stability, dependability, predictability terms.
It is most often given in elementary discussions of the subject.
the definition
A
second approach is epitomized by the question; Are the measures obtained from a
measuring instrument the "true" measures of the property measured? This is an accuracy
definition. Compared to the first definition, it is further removed from common sense and
intuition, but it is also more fundamental. These two approaches or definitions can be
summarized in the words stability and accuracy. As we will see later, however, the
accuracy definition implies the stability definition.
There is a third approach to the definition of reliability, an approach that not only
helps us better define and solve both theoretical and practical problems but also implies
We can inquire how much error of measurement there is
other approaches and definitions.
in ameasuring instrument. Recall that here are two general types of variance; systematic
and random. Systematic variance leans in one direction; scores tend to be all positive or all
negative or all high or all low. Error in this case is constant or biased. Random or error
variance is self-compensating; scores tend now to lean this way, now that way. Errors of
measurement are random errors. They are the sum of a number of causes; the ordinary
random or chance elements present in all measures due to unknown causes, temporary or
momentary fatigue, fortuitous conditions at a particular time that temporarily affect the
object measured or the measuring instrument, fluctuations of memory or mood, and other
factors that are temporary and shifting. To the extent that errors of measurement are
present in a measuring instrument, to that extent the instrument is unreliable. In other
words, reliability can be defined as the relative absence of errors of measurement in a
measuring instrument.
Reliability is the accuracy or precision of a measuring instrument. A homely example
can easily show what is meant. Suppose a sportsman wishes to compare the accuracy of
two guns. One is an old piece made a century ago but still in good condition. The other is
a modem weapon made by an expert gunsmith. Both pieces are solidly fixed in granite
bases and aimed and zeroed in by a sharpshooter. Equal numbers of rounds are fired with
each gun. In Figure 26.1, the hypothetical pattern of shots on a target for each gun is
shown. The target on the left represents the pattern of shots produced by the older gun.
Observe that the shots are considerably scattered. Now observe that the pattern of shots on
the target on the right is more closely packed. The shots are closely clustered around the
bull's-eye.
Let us assume that numbers have been assigned to the circles of the targets; 3 to the
bull's-eye, 2 to the next circle, I to the outside circle, and to any shot outside the target.
Figure 26.1
the newer rifle. These measures can be considered reUability indices. The smaller varia-
bility measure of the new rifle indicates much less error, and thus much greater accuracy.
The new rifle is reliable; the old rifle is less reliable.
Similarly, psychological and educational measurements have greater and lesser
reliabilities. A measuring instrument, say an arithmetic achievement test, is given to a
group of children —
usually only once. Our goal, of course, is a multiple one: we seek to
hit the "true" score of each child. To the extent that we miss the "true" scores, to this
extent our measuring instrument, our test, is unreliable. The "true," the "real," arithme-
tic scores of five children, say, are 35, 31, 29, 22, 14. Another researcher does not know
these "true" scores. His results are: 37, 30, 26, 24, 15. While he has not in a single case
hit the "true" score, he has achieved the same rank order. His reliability and accuracy are
surprisingly high.
Suppose that his five scores had been: 24, 37, 26, 15, 30. These are the same five
scores, but they have a very different rank order. In this case, the test would be unreliable,
because of its inaccuracy. To show all this more compactly, the three sets of scores, with
their rank orders,have been set beside each other in Table 26.1. The rank orders of the
and second columns covary exactly. The rank-order coefficient of correlation is .00.
first 1
Even though the test scores of the second column are not the exact scores, they are in the
same rank order. On this basis, using a rank-order coefficient of correlation, the test is
The rank-order coefficient of correlation between the ranks of the first and third
reliable.
columns, however, is zero, so that the latter test is completely unreliable.
Table 26.1 "True." Reliable, and Unreliable Obtained Test Scores and Rank Orders of Five
Children
1 2 3
"True" Scores from Scores from
Scores (Rank) Reliable Test (Rank) Unreliable Test (Rank)
35
Reliability • 407
THEORY OF RELIABILITY
The example given in Table 26.1 epitomizes what we need to know about reliability.' It
is necessary, now, to formalize the intuitive notions and to outline a theory of reliability.
This theory is not only conceptually elegant; it is also practically powerful. It helps to
unify measurement ideas and supplies a foundation for understanding various analytic
techniques. The theory also ties in nicely with the variance approach emphasized in earlier
discussions.
Any set of measures has a total variance, that is, after administering an instrument to
a set of objects and obtaining a set of numbers (scores), we can calculate a mean, a
standard deviation, and a variance. Let us be concerned here only with the variance. The
variance, as seen earlier, is a lotitl obtained variance, since it includes variances due to
several causes. In general, any total obtained variance (or sum of squares) includes sys-
tematic and error variances.
Each person has an obtained score, X,. (The "r" stands for "total.") This score has
two components: a "true" component and an error component. We assume that each
person has a "true" score, X». (The "^" is the infinity sign, and is used to signify
"true.") This score would be known only to an omniscient being. ^ In addition to this
"true" score, each person has an error score, X^. The error score is some increment or
decrement resulting from several of the factors responsible for errors of measurement.
This reasoning leads to a simple equation basic to the theory:
X, = X^ + Xe (26.1)
which says, succinctly, that any obtained score is made of two components, a "true"
component and an error component. The only part of this definition that gives any real
trouble is X„, which can be conceived to be the score an individual would obtain if all
internal and external conditions were "perfect" and the measuring instrument were "per-
fect." A bit more realistically, it can be considered to be the mean of a large number of
administrations of the test to the same person. Symbolically, X» = (X, + X2 + + • •
X„)ln.^
The treatment of reliability in this chapter is based on traditional error theory. See J. Guilford. Pychomeiric
'
Methods. 2d ed. New York; McGraw-Hill. 1954, chaps. 13 and 14. While this theory has been shown to have
unnecessary assumptions, it is admirably suited to conveying to the beginning student the basic nature of
reliability. For a criticism of the theory, see R Tryon. "Reliability and Behavior Domain Validity: Reformula-
tion and Historical Critique." Psychological Bulleiin, 54 (1957). 229-249. In practice, the two approaches
arrive at the same formulas. The most recent development of reliability theory and practice, called generalizabil-
ity theory, emphasizes multivariate (or multifacet) thinking, components of variance analysis, and
decision
making. An extended discussion of the theory is given in: L. Cronbach. G. Gleser. H. Nanda. and N. Rajarat-
nam, The Dependability of Behavioral Measurement: Theory of Generalizability for Scores and Profiles. New
York: Wiley. 1972. (Also see footnote 3. below).
'This does not mean that X^ may not include properties other than the property being measured. All
systematic variance is included in X,.. The problem of measuring the property is a validity problem.
Mt must be emphasized that the notion of a "true" score is a fiction, albeit a useful fiction. Kaplan calls it
"the fiction of the true measure." A. Kaplan, The Conduct of Inquiry. San Francisco: Chandler. 1964, p. 202.
— —
He points out that the "true measure" is a limit much as in the calculus toward which the measures converge
(p. 203). Whenever "true score." or X^. is used. then, it is understood that
the expression is a convenient
fiction. Lord and Novick. in their authoritative book on test score theory, define true score as the expected value
of an observed score, which can be interpreted as the average score an individual "would obtain on infinitely
many independent repeated measurements (an unobservable quantity)." F. Lord and M. Novick. Statistical
Theories Mental Test Scores. Reading. Mass.: Addison- Wesley, 1968, pp. 30-31. Similar qualifications are
of
to be understood with notions like "true variance" and the correlation between obtained scores and true scores
(see below).
408 • Measurement
With a little simple algebra. Equation 26.1 can be extended to yield a more useful
equation in variance terms:
V, = V„ + V, (26.2)
Equation 26.2 shows that the total obtained variance of a test is made up of two variance
components, a "true" component and an "error" component. If, for example, it were
possible to administer the same instrument to the same group 4,367,929 times, and then to
calculate the means of each person's 4,367,929 scores, we would have a set of "nearly
true" measures of the group. In other words, these means are the X^'s of the group. We
could then calculate the variance of the X^'s yielding Vx. This value must always be less
than V„ the variance calculated from the obtained set of original scores, the X/s, because
the original scores contain error, whereas the "true," or "nearly true," scores have no
error, the error having been washed out by the averaging process. Put differently, if there
were no errors of measurement in the X,'s. then V, = Vr. But, there are always errors of
measurement, and we assume that if we knew the error scores and subtracted them from
the obtained scores we would obtain the "true" scores.
We never know the "true" scores nor do we really ever know the error scores.
Nevertheless, it is possible to estimate the error variance. By so doing, we can, in effect,
substitute in Equation 26.2and solve the equation. This is the essence of the idea, even
though certain assumptions and steps have been omitted from the discussion. A diagram
or two may show the ideas more clearly. Let the total variances of two tests be represented
by two bars. One test is highly reliable; the other test only moderately so, as shown in
Figure 26.2. Tests A and B have the same total variance, but 90 percent of Test A is
"true" variance and 10 percent is error variance. Only 60 percent of Test B is "true"
variance and 40 percent is error variance. Test A is thus much more reliable than Test B.
Reliability is defined, so to speak, through error: the more error, the greater the
unreliability: the less error, the greater the reliability. Practically speaking, this means that
if we
can estimate the error variance of a measure we can also estimate the measure's
reliability. This brings us to two equivalent definitions of reliability:
1. Reliability is the proportion of the "true" variance to the total obtained variance of the data
yielded by a measuring instrument.
2. Reliability is the proportion of error variance to the total variance yielded by a measuring
instrument subtracted from 1.00, the index 1.00 indicating perfect reliability.
V ^ Nr^
Test A kyyy/yyyyyy/y/yyyyyyyyAv-vi
90% 10%
-A.
TestB
c
ly / y / / /
I',
/ y / / / y ^ykww www i
60% 40%
Figure 26.2
Reliability • 409
(26.3)
r,, = 1 - -^ (26.4)
where r„ is the reliability coefficient and the other symbols are as defined before. Equa-
tion 26.3 is theoretical and cannot be used for calculation. Equation 26.4 is both theoreti-
cal and practical. It can be used both to conceptualize the idea of reliability and to estimate
the reliability of an instrument. An alternate equation to (26.4) is:
^' " ^^
(26.5)
V,
To show the nature of reliability, two examples are given in Table 26.2. One of them,
labeled I in the table, is an example of high reliability; the other, labeled II, is an example
of low reliability. Note carefully that exactly the same numbers are used in both cases.
The only difference is that they are arranged differently. The situation in both cases is this:
five individuals have been administereda test of four items. (This is unrealistic, of course,
but it will do to illustrate several points.) The data of the five individuals are given in the
rows; the sums of the individuals are given to the right of the rows (S,). The sums of the
items are given at the bottom of each table (2„). In addition, the sums of the individuals
on the odd items (So) and the sums of the individuals on the even items (1^) are given on
the extreme right of each subtable. The calculations necessary for two-way analyses of
variance are given below the data tables.
To make examples more realistic, imagine that the data are scores on a six-point
the
scale, say attitudes toward school. A high score means a high favorable attitude, a low
score a low favorable (or unfavorable) attitude. (It makes no difference, however, what
the scores are. They can even be I's and O's resulting from marking items of an achieve-
ment test: right = 1, and wrong = 0.) In I, Individual 1 has a high favorable attitude
toward school, whereas Individual 5 has a low favorable attitude toward school. These are
readily indicated by the sums of the individuals (or the means): 21 and 5. These sums (2,)
are the usual scores yielded by tests. For instance, if we wanted to know the mean of the
group, we would calculate it 10 -I- 5)/5 = 13.60.
as (21 + 18 -f- 14 -h
The variance of these sums provides one of the terms of Equations 26.4 and 26.5, but
not the other: V,. but not V^. By using the analysis of variance it is possible to calculate
both V, and V^. The analyses of variance of I and II show how this is done. These
calculations need not detain us long, since they are subsidiary to the main issue.
The analysis of variance yields the variances: Between items. Between individuals,
and Residual or Error. The F ratios for Items are not significant in I or II. (Note that both
mean squares are 2.27. Obviously they must be equal, since they are calculated from the
same sums at the bottoms of the two subtables.) Actually, we are not interested in these
variances —
we only want to remove the variance due to items from the total variance. Our
interest lies in the Individual variances and in the Error variances, which are circled in the
subtables. The total variance of Equations 26.3, 26.4, and 26.5 is interesting because it
410 • Measurement
for the data ot 1 and .45 for the data of II. The hypothetical data of I are reliable; those of
II are not reliable.
to understand this is to go back to Equation 26.3. Now we write
Perhaps the best way
r„ = yjVma- we had a direct way to calculate V., we could quickly calculate r,„ but as
If
we saw before, we do not have a direct way. There is a way to estimate it, however. If
we can find a way to estimate V,., the error variance, the problem is solved because V,, can
be subtracted from V.^j to yield an estimate of V,.. Obviously we can ignore Vv: and
subtract the proportion V,./V,nu from and get r„. This is perfectly acceptable way to
1
calculate r„ and to conceptualize reliability. Reasoning from Vinci ~ ^<- 's perhaps more
fruitful and ties in nicely with our earlier discussion of components of variance.
It was said in Chapter 13 that each statistical problem has a total amount of variance
and each variance source contributes to this total variance. We translate the reasoning of
Chapter 13 to the present problem. In random samples of the same population, V^ and V„,
should be statistically equal. But, if V^, the between-groups variance, is significantly
greater than V„.. the within-groups (error) variance, then there is something in V^, over and
above chance. That is, V/, includes the variance of V„. and, in addition, some systematic
variance.
Similarly, we can say that if V,nd is significantly greater than V,„ then there is some-
thing in V|nd over and above error variance. This excess of variance would seem to be due
to individual differences in whatever is being measured. Measurement aims at the "true"
scores of individuals. When we say that reliability is the accuracy of a measuring instru-
ment, we mean that a reliable instrument measures the "true" scores of
more or less
individuals, the "more or less" depending on the reliability of the instrument. That
"true" scores are measured can be inferred only from the "true" differences between
individuals, although neither of these can be directly measured, of course. What we do is
to infer the "true" differences from the fallible, empirical, measured differences, which
are always to some extent corrupted by errors of measurement.
Now. if there is some way to remove from V,nd the effect of errors of measurement,
some way to free V,nd of error, we can solve the problem easily. We simply subtract V^
from Vjnd to get an estimate of V». Then the proportion of the "pure" variance to all the
variance,"pure" and "impure," is the estimate of the reliability of the measuring instru-
ment. To summarize symbolically:
r„ = = = 1 - t:
—
'ind
and 16 - 6 = Given the same individuals, the more reliable a measure the greater the
10.
range of the sums of the individuals. Think of the extreme: a completely unreliable
instrument would yield sums that are like the sums yielded by random numbers, and, of
course, the reliability of random numbers is approximately zero. (The nonsignificant F
ratio for Individuals, 1.81, in II indicates that r„ = .45 is not statistically significant.)
Now examine the rank orders of the values under the items, a. b, c. and d. In I, all
four rank orders are about the same. Each item of the attitude scale, apparently, is measur-
ing the same thing. To the extent that the individual items yield the same rank orders
of
individuals, to this extent the test is reliable. The items hang together, so to speak. They
are internally consistent. Also, notice that the rank orders of the items of I are about the
The rank orders of the item values of II are quite different. The rank orders of a and c
agree very well; they are the same as those of I. The rank orders of a and b, a and d, b
and d, and c and d. however, do not agree very well. Either the items are measuring
different things, or they are not measuring very consistently. This lack of congruence of
rank orders is Although the rank order of these
reflected in the totals of the individuals.
totals is similar to the rank order of the totals of I, the range or variance is considerably
less, and there is lack of spread between the sums (for example, the three 16's).
We conclude our consideration of these two examples by considering certain figures in
Table 26.2 not considered before. On the right-hand side of both I and II the sums of the
odd items (Lg) and the sums of the even items (S^.) are given. Simply add the values of
odd items across the rows: a + c:6 + 5= 11,4 + 5 = 9,4 + 4 = 8, and so forth, in I.
Then add the values of the even items: i> + J: 6 + 4 = 10,6 + 3 = 9, and so forth, in I
also. If there were more items, for example, a. b. c. d. e. f, g, then we would add:
a + c + e + g for the odd sums, and b + d + f for the even sums. To calculate the
reliability coefficient, calculate the product-moment correlation between the odd sums
and the even sums, and then correct the resulting coefficient with the Spearman-Brown
formula.'* The odd-even r„'s for I and II are .91 and .32, respectively, fairly close to the
analysis of variance results of .92 and .45. (With more subjects and more items, the
estimates will ordinarily be close.)
This simple operation may seem mystifying. To see that this is a variation of the same
variance and rank-order theme, let us note, first, the rank order of the sums of the two
examples. The rank orders of 2„ and 2,, are almost the same in I, but quite different in II.
The reasoning is the same as before. Evidently the items are measuring the same thing in
1, but in II the two sets of items are not consistent. To reconstruct the variance argument,
remember that by adding the sum of the odd items to the sum of the even items for each
person the total sum, or S^ + S^ = S„ is obtained.
r„ = r^~ (26.6)
Although it is not possible to calculate r,,^ directly, it is helpful to understand the rationale
of the reliability coefficient in these theoretical terms.
Another theoretical interpretation is to conceive that each X^_ can be the mean of a
"See any measurement text, for example. A. Anastasi, Psychological Testing. 4th ed. New York: Macmil-
lan. 1976, pp. 15-116. The sums of the odd and the sums of the even items are, of course, the sums of only
1
They are therefore less reliable than the sums of all the items. The Spearman-Brown
half the items in a test.
formula corrects the odd-even coefficient (and other part coefficients) for the lesser number of items used in
calculating the coefficient.
Reliability • 413
large number of X,'s derived from administering the test to an individual a large number of
times, other things being equal. The idea behind this notion has been explained before.
The first administration of the test yields, say, a certain rank order of individuals. If the
second, third, and further measurings all tend to yield approximately the same rank order,
then the test is reliable. This is a stability or test-retest interpretation of reliability.
Another interpretation is that reliability is the internal consistency of a test; the test
items are homogeneous. This interpretation in effect boils down to the same idea as other
interpretations: accuracy. Take any random sample of items from the test, and any other
random and different sample of items for the test. Treat each sample as a separate subtest.
Each individual will then have two scores: one X, for one subsample and another X, for the
other subsample. Correlate the two sets, continuing the process indefinitely. The average
intercorrelation of the subsamples (corrected by the Spearman-Brown formula) shows the
test's internal consistency.^ But this means, really, that each subsample, if the test is
reliable, succeeds in producing approximately the same rank order of individuals. If it
Two important aspects of reliability are the reliability of means and the reliability of
individual measures. These are tied to the standard error of the mean and the standard
error of measurement. In research studies, ordinarily, the standard error of the mean —
and related statistics like the standard error of the differences between means and the
standard error of a correlation coefficient —
is the more important of these. Since the
standard error of the mean was discussed in considerable detail in an earlier chapter, it is
only necessary to say here that the reliability of specific statistics is another aspect of the
general problem of reliability. The standard error of measurement, or its square, the
standard variance of measurement, needs to be defined and identified, if only briefly. This
will be done through use of a simple example.
investigator measures the attitudes of five individuals and obtains the scores given
An
under the column labeled X, in Table 26.3. Assume, further, that the "true" attitude
scores of the five individuals are those given under the column labeled X^. (Remember,
however, that we can never know these scores.) It can be seen that the instrument is
While only one of the five obtained scores is exactly the same as its companion
reliable.
"true" score, the differences between those obtained scores that are not the same and the
"true" scores are all small. These differences are shown under the column labeled "X^";
they are "error scores." The instrument is evidently fairly accurate. The calculation of r„
'See L. Cronbach, "Coefficient Alpha and the Internal Structure of Tests." Psychomelrika. 16
(1951),
297-334; Tryon, op. cil. The formulas given by Cronbach and Tryon look different from Equations 26.3 and
analysis of variance to estimate
26.4. They yield the same results, however. The origmator of the use of the
reliability seems to have been Hoyt. See C. Hoyt, "Test Reliability
Obtamed by Analysis of Variance,"
stressed the use
Psychomelrika. 6 (1941). 153-160. Ebel extended the use of analysis of variance to ratings and
Ratings," Psychometrika,
of the intraclass coefficient of correlation: R. Ebel, "Estimation of the Reliability of
16 (1951). 407-424.
414 • Measurement
X,
2
Reliability • 415
coefficient of correlation between two variables low because one or both measures are
unreliable? Is an analysis of variance F ratio not significant because the hypothesized
relation does not exist or because the measure of the dependent variable is unreliable?
Reliability, while not the most important facet of measurement, is still extremely
important. In a way, this is like the money problem: the lack of it is the real problem. High
reliability is no guarantee of good scientific results, but there can be no good scientific
results without reliability. In brief, reliability is a necessary but not sufficient condition of
the value of research results and their interpretation.
Chapter 2 7
Validity
possible to study reliability without inquiring into the meaning of variables. It is not
possible to study validity, however, without sooner or later inquiring into the nature and
meaning of one's variables.
When measuring certain physical properties and relatively simple attributes of per-
sons, validity is no great problem. There is often rather direct and close congruence
between the nature of the object measured and the measuring instrument. The length of an
object, for example, can be measured by laying sticks, containing a standard number
system in feet or meters, on the object. Weight is more indirect, but nervertheless not
difficult: an object placed in a container displaces the container downward. The down-
ward movement of the container is registered on a calibrated index, which reads
"pounds" or "ounces." With some physical attributes, then, there is little doubt of what
is being measured.
On the other hand, suppose an educational scientist wishes to study the relation be-
tween intelligence and school achievement or the relation between authoritarianism and
teaching style. Now there are no rulers to use, no scales with which to weigh the degree of
authoritarianism, no clear-cut physical or behavioral attributes that point unmistakably to
teaching style. It is necessary in such cases to invent indirect means to measure psycho-
logical and educational properties. These means are often so indirect that the validity of
the measurement and its products is doubtful.
Validity • 417
TYPES OF VALIDITY
The commonest dctinition of validity is epitomi/.cd by the question; Are we measuring
what we think we are measuring? The emphasis in this question is on what is being
measured. For example, a teacher has constructed a test to measure uiulerstamUng of
scientific procedures and has included in the test only /i/(7i/(// items about scientific proce-
dures. The test is not valid, because while it may reliably measure the pupils' factual
knowledge of scientific procedures, it does not measure their understanding of such proce-
dures. In other words, it may measure what it measures quite well, but it does not measure
what the teacher intended it to measure.
Although the commonest definition of validity was given above, it must immediately
be emphasized that there is no one validity. A test or scale is valid for the scientific or
practical purpose of its user. Educators may be interested in the nature of high school
pupils' achievement in mathematics. They would then be interested in what a mathematics
achievement or aptitude test measures. They might, for instance, want to know the factors
that enter into mathematics test performance and their relative contributions to this perfor-
mance. On the other hand, they may be primarily interested in knowing the pupils who
will probably be successful and those who will probably be unsuccessful in high school
mathematics. They may have little interest in what a mathematics aptitude test measures.
They are interested mainly in successful prediction. Implied by these two uses of tests are
different kinds of validity. We now examine an extremely important development in test
theory: the analysis and study of different kinds of validity.
The most important classification of types of validity is that prepared by a joint
committee of the American Psychological Association, the American Educational Re-
search Association, and the National Council on Measurements Used in Education.'
Three types of validity are discussed: content, criterion-related, and construct. Each of
these will be examined briefly, though we put the greatest emphasis on construct validity,
since it is probably the most important form of validity from the scientific research point
of view.
A university psychology professor has given a course to seniors in which she has empha-
sized the understanding of principles of human development. She prepares an objective-
type test. to know something of its validity, she critically examines each of the
Wanting
test's relevance to understanding principles of human development. She also
items for its
asks two colleagues to evaluate the content of the test. Naturally, she tells the colleagues
what it is she is trying to measure. She is investigating the content validity: of the test.
Content validity is the representativeness or sampling adequacy of the content — the
substance, the matter, the topic — of a measuring instrument. Content validation is guided
by the question: Is measure representative of the content or
the substance or content of this
the universe of content of the property being measured? Any psychological or educational
property has a theoretical universe of content consisting of all the things that can possibly
be said or observed about the property. The members of this universe, t/, can be called
Standards for Educational and Psychological Tests. Washington, D.C.: American Psychological Associa-
'
tion, 1974. An important article that explains in detail the system and thinking of the committee in relation to
Bulletin, 52
validity is: L. Cronbach and P. Meehl. "Construct Validity of Psychological Tests." Psychological
In R.
(1955). 281-302. A detailed and definitive more recent statement is: L. Cronbach. "Test Validation."
Thomdike, ed.. Educational Measurement. 2d ed. Washington, D.C.: American Council on Education, 1971,
pp. 443-507.
418 • Measurement
items. The property might be "arithmetic achievement," to take a relatively easy exam-
ple. U
has an infinite number of members; all possible items using numbers, arithmetic
operations, and concepts. A test high in content validity would theoretically be a repre-
sentative sample of U. If it were possible to draw items from U at random in sufficient
numbers, then any such sample of items would presumably form a test high in content
validity. If U consists of subsets A, B, and C, which are arithmetic operations, arithmetic
concepts, and number manipulation, respectively, then any sufficiently large sample of U
would represent A, B, and C approximately equally. The test's content validity would be
satisfactory.
Ordinarily, and unfortunately, it is not possible to draw random samples of items from
a universe of content. Such universes exist only theoretically. True, it is possible and
desirable to assemble large collections of items, especially in the achievement area, and to
draw random samples from the collections for testing purposes. But the content validity of
such collections, no matter how large and how "good" the items, is always in question.
If it is not possible to satisfy the definition of content validity, how can a reasonable
and while the content of most achievement tests is "self-validated" in the sense that the
individual writing the test to a degree defines the property being measured (for example, a
teacher writing a classroom test of spelling or arithmetic), it is dangerous to assume the
adequacy of content validity without systematic efforts to check the assumption. For
example, an educational investigator, testing hypotheses about the relations between so-
cial studies achievement and other variables, may assume the content validity of a social
studies test. The theory from which the hypotheses were derived, however, may require
understanding and application of social studies ideas, whereas the test used may be
almost purely factual in content. The test lacks content validity for the purpose. In fact,
the investigator is not really testing the stated hypotheses.
Content validation, then, is The items of a test must be studied,
basically judgmental.
each item being weighed for presumed representativeness of the universe. This means
its
that each item must be judged for its presumed relevance to the property being measured,
which is no easy task. Usually other "competent" judges should judge the content of the
items. The universe of content must, if possible, be clearly defined; that is, the judges
must be furnished with specific directions for making judgments, as well as with specifi-
cation of what they are judging. Then, some method for pooling independent judgments
can be used."
^An excellent guide to the content validity of achievement tests is: B. Bloom, ed.. Taxonomy of Educa-
tional Objectives. Handbook I: Cognitive Domain. New York: David McKay, 1956. This is a comprehensive
attempt to outline and discuss educational goals in relation to measurement.
Validity • 419
believed to measure the attribute under study. When one predicts success or failure of
students from academic aptitude measures, one is concerned with criterion-related valid-
ity. How well does the test (or tests) predict to graduation or to grade-point average?'' One
does not care so much what the test measures as one cares for its predictive ability. In fact,
in criterion-related validation, which is often practical and applied research, the basic
interest is usually more in the criterion, some practical outcome, than in the predictors. (In
basic research this is not so.) The higher the correlation between a measure or measures of
academic aptitude and the criterion, say grade-point average, the better the validity. In
short and again, the emphasis is on the criterion and its prediction.'*
The word prediction is usually associated with the future. This is unfortunate because,
in science, prediction does not necessarily maxn forecast One "predicts" from an inde- .
ment, either now or in the future, against some outcome or measure. In a sense, all tests
are predictive; they "predict" a certain kind of outcome, some present or future state of
affairs. Aptitude tests predict future achievement; achievement tests predict present and
future achievement and competence; and intelligence tests predict present and future
ability to learn and to solve problems. Even if we measure self-concept, we predict that if
the self-concept score is so-and-so, then the individual will be such-and-such now or in
the future.
The single greatest difficulty of criterion-related validation is the criterion. Obtaining
criteria may even be difficult. What measure of teacher
criterion can be used to validate a
effectiveness? Who
judge teacher effectiveness? What criterion can be used to test
is to
the predictive validity of a musical aptitude test?
"To make a decision, one predicts the person's success under each treatment and uses a
rule to translate the prediction into an assignment.
^
'
A
test high in criterion-related validity
'
is one that helps investigators make successful decisions in assigning people to treatments,
'Criterion-related validity used to be caWed predictive validity. A related term is concurrent validity, which
differs from predictive validity in the time dimension: the criterion is measured at about the same time as the
predictor. In this sense, the test serves to assess the present status of individuals.
"For a discussion of desirable qualities of a criterion, see R. Thomdike and E. Hagen. Measurement and
Evaluation Psychology and Education. 4lh ed.
in New York: Wiley, 1977, pp. 61-64.
^Cronbach, op. cit., p. 484.
420 • Measurement
Both multiple predictors and multiple criteria can be and are used. Later, when we study
multiple regression, we will focus on multiple predictors and how to handle them statisti-
cally. Multiple criteria can be handled separately or together, though it is not easy to do
the latter. In practical research, a decision must usually be made. If there is more than one
criterion, how can we best combine them for decision-making? The relative importance of
the criteria, of course, must be considered. Do we want an administrator high in problem-
solving ability, high in public relations ability, or both? Which is more important in the
particular job? It is highly likely that the use of both multiple predictors and multiple
criteria will become common as multivariate methods become better understood and the
computer is used routinely in prediction research.
Construct validity is one of the most significant scientific advances of modem measure-
ment theory and practice. It is a significant advance because it links psychometric notions
performance. They ask: What factors or constructs account for variance in test perfor-
mance? Does this test measure verbal ability and abstract reasoning ability? Does it also
"measure" social class membership? They ask, for example, what proportions of the total
testvariance are accounted for by the constructs verbal ability, abstract reasoning ability,
and social class membership. In short, they seek to explain individual differences in test
scores. Their interest is usually more in the properties being measured than in the tests
used to accomplish the measurement.
Researchers generally start with the constructs or variables entering into relations.
Suppose that a researcher has discovered a positive correlation between two measures,
one a measure of educational traditionalism and the other a measure of the perception of
the characteristics associated with the "good" teacher. Individuals high on the tradition-
alism measure see the "good" teacher as efficient, moral, thorough, industrious, consci-
entious, and reliable. Individuals low on the traditionalism measure may see the "good"
teacher in a different way. The researcher now wants to know why this relation exists,
what is behind it. To learn why, the meaning of the constructs entering the relation,
"perception of the "good" teacher" and "traditionalism." must be studied. How to study
these meanings is a construct validity problem.*'
One can see that construct validation and empirical scientific inquiry are closely al-
lied. It is not simply a question of validating a test. One must try to validate the theory
behind the test. Cronbach says that there are three parts to construct validation: suggesting
what constructs possibly account for test performance, deriving hypotheses from the
theory involving the construct, and testing the hypotheses empirically.^ This formulation
is but a precis of the general scientific approach discussed in Part One.
The significant point about construct validity, that which sets it apart from other types
of validity, is its preoccupation with theory, theoretical constructs, and scientific empiri-
cal inquiry involving the testing of hypothesized relations. Construct validation in meas-
This example was taken from the following research: F. KerlingerandE. Pedhazur. '"Educational Attitudes
and Perceptions of Desirable Traits of Teachers," American Educational Research Journal. 5 (1968), 543-560.
'L. Cronbach, Essentials of Psychological Testing. 3d ed. New York: Harper & Row, 1970. p. 143.
Validity • 421
urement contrasts sharply with approaches that define the vahdity of a measure primarily
by its success in predicting a criterion. For example, a purely empirical tester might say
that a test is valid if it efficiently distinguishes individuals high and low in a trait. Why the
test succeeds in separating the subsets of a group is of no great concern. It is enough that
it does.
Note that the testing of alternative hypotheses is particularly important in construct valida-
tion because both convergence and discriminability are required. Convergence means that
evidence from different sources gathered in different ways all indicates the same or similar
meaning of the construct. Different methods of measurement should converge on the
construct. The evidence yielded by administering the measuring instrument to different
groups in different places should yield similar meanings or, if not, should account for
differences. A measure of the self-concept of children, for instance, should be capable of
similar interpretation in different parts of the country. If it is not capable of such interpre-
tation in some locality, the theory should be able to explain why — indeed, it should
predict such a difference.
Discriminahility one can empirically differentiate the construct from other
means that
constructs that may be and that one can point out what is unrelated to the con-
similar,
struct. We point out, in other words, what other variables are correlated with the construct
and how they are so correlated. But we also indicate what variables should be uncorrelated
with the construct. We point out, for example, that a scale to measure conservatism should
and does correlate substantially with measures of authoritarianism and rigidity the the- —
ory predicts this —
but not with measures of social desirability.** Let us illustrate these
ideas.
Let us assume that an investigator is interested in the determinants of creativity and the
relation of creativity to school achievement. He notices that the most sociable persons,
who exhibit affection for others, also seem to be less creative than those who are less
sociable and affectionate. He wants to test the implied relation in a controlled fashion.
One of his first tasks is measure of the sociable-affectionate
to obtain or construct a
characteristic. The may be a reflec-
investigator, surmising that this combination of traits
tion of a deeper concern of love for others, calls it amorism. He assumes that there are
individual differences in amorism, that some people have a great deal of it, others a
moderate amount, and still others very little.
He must first construct an instrument to measure amorism. The literature gives little
help, since scientific psychologists have rarely investigated the fundamental nature of
love. Sociability, however, has been measured. The investigator must construct a new
instrument, basing its content on intuitive and reasoned notions of what amorism is. The
reliability of the test, tried out with large groups, runs between .75 and .85.
The question now is whether the test is valid. The investigator correlates the instru-
ment, calling it the A scale, with independent measures of sociability. The correlations are
moderately substantial, but he needs evidence that the test has construct validity. He
deduces certain relations that should and should not exist between amorism and other
"See F. Kerlinger, "A Social Attitude Scale: Evidence on Reliability and Validity." Psychological Re-
ports, 26 (1970). 379-383.
422 • Measurement
variables. If amorism is a general tendency to love others, then it should correlate with
characteristics like cooperativeness and friendliness. Persons high in amorism will ap-
proach problems in an ego-oriented manner as contrasted to persons low in amorism, who
will approach problems in a task-oriented manner.
Acting on this reasoning, the investigator administers the A scale and a scale to
measure subjectivity to a number of tenth-grade students. To measure cooperativeness he
observes the classroom behavior of the same group of students. The correlations between
the three measures are positive and significant.''
Knowing the pitfalls of psychological measurement, the investigator is not satisfied.
These positive correlations may be due to a factor common to all three tests, but irrelevant
to amorism; for example, the tendency to give '"right"' answers. (This would probably be
ruled out, however, because the observation measure of cooperativeness correlates posi-
tively with amorism and subjectivity.) So, taking a new group of subjects, he administers
the amorism and subjectivity scales, has the subjects' behavior rated for cooperativeness,
and in addition, administers a creativity test that has been found in other research to be
reliable.
The investigator states the relation between amorism and creativity in hypothesis
form: The relation between the A scale and the creativity measure will be negative and
significant. The correlations between amorism and cooperativeness and between amorism
and subjectivity will be positive and significant. "'Check" hypotheses are also formu-
lated: The correlation between cooperativeness and creativity will not be significant; it
will be near zero, but the correlation between subjectivity and creativity will be positive
and significant. This last relation is predicted on the basis of previous research findings.
The six correlation coefficients are in the correlation matrix of Table 27 1 The four
given . .
Table 27.1
Validity • 423
though cooperativeness was expected to correlate with amorism, there was no theoretical
reason to expect it to correlate at all with creativity.
Another example of a different kind is when an investigator deliberately introduces a
measure that would, if it correlates with the variable whose validity is under study,
invalidate other positive relations. One bugaboo of personality and attitude scales is the
social desirability phenomenon, mentioned earlier. The correlation between the target
variable and a theoretically related variable may be because the instruments measuring
both variables are tapping social desirability rather than the variables they were designed
to tap. One can partly check whether this is so by including a measure of social desirabil-
ity along with other measures.
Despite all the evidence leading the investigator to believe that the A scale has con-
struct validity, he may still which he has pupils high
be doubtful. So he sets up a study in
and low in amorism solve problems, predicting that pupils low in amorism will solve
problems more successfully than those high in amorism. If the data support the prediction,
this is further evidence of the construct validity of the amorism measure. It is of course a
significant finding in and of itself. Such a procedure, however, is probably more appropri-
ate with achievement and attitude measures. One can manipulate communications, for
example, in order to change attitudes. If attitude scores change according to theoretical
prediction, this would be evidence of the construct validity of the attitude measure, since
the scores would probably not change according to prediction if the measure were not
measuring the construct.
Multitrait-Multimethod
'"D. Campbell and D. Fiske, "Convergent and Discriminant Validation by the
Matrix." Psychological Bulletin. 54 (1959), 81-105.
The
'
The data are from one study of a number of studies done to test a structural theory of social attitudes.
'
Erlbaum. 1984.
Structure and
'-The samples, the scales, and some of the results are described in: F. Kerlmger, "The
Measurement. 32
Content of Social Attitude Referents: A Preliminary Study," Educational and Psychological
(1972), 613-630. The data reported in Table 27.2 were obtained from a Texas
sample, N = 227,
424 • Measurement
A Measure of Anti-Semitism
In an unusual attempt to validate their measure of anti-Semitism, Glock and Stark used
responses to two incomplete sentences about Jews: "It's a shame that Jews ..." and "I
can't understand why Jews. ."''' Coders considered what each subject had written and
. .
characterized the responses as negative, neutral, or positive images of Jews. Each subject,
then, was characterized individually as having one of the three different perceptions of
Jews. When the responses to the Index of Anti-Semitic Beliefs, the measure being vali-
dated, were divided into None, Medium, Medium High, and High Anti-Semitism, the
percentages of negative responses to the two open-ended questions were, respectively: 28,
41 61 75. This is good evidence of validity, because the individuals categorized None to
, ,
Expressing dissatisfaction with most attempts to define and measure creativity, Amabile
proposed a consensual definition that focused on the judgment of products: "A product or
response is creative to the extent that appropriate observers independently agree it is
creative. Appropriate observers are those familiar with the domain in which the product
was created or the response articulated."'^ The actual measurement method Amabile used
was to ask judges to assess the creativity of products produced by individuals using their
own criteria of what is creative. The judges should have had experience with the products
being judged. For example, to apply the method in the assessment of artistic creativity,
Amabile had professional artists and art teachers judge the creativity of collages made by
children. (A collage is an artistic composition of materials pasted on a surface of some
kind.) Let us look at one of her studies, an attempt at construct validation of the method.
Twenty-two young girls, age 7-11 years, were asked to make designs using materials
supplied by the researcher: pieces of paper in different sizes and shapes, white cardboard,
and glue. Each child received the same materials. The children were told to use the
materials in any way they wished to make a design that was "silly." They worked at this
for 18 minutes. Then the judges —
art teachers, artists, and psychologists (used to provide
varying expertise with art) — were told how were produced and asked to judge
the designs
their creativity using a five-point rating system that produced numerical measures reflect-
ing degrees of creativity. Amabile found substantial to high reliabilities and good agree-
( Monograph Supplement 9. Loevinger argues that construct validity, from a scientific point of
1957), 635-694.
view, is the whole of validity. Al the other extreme. Bechtoldt argues that construct validity has no place in
psychology. H. Bechtoldt, "Construct Validity: A Critique," American Psychologist. 14 (1959),
619-629.
'^C. Glock and R. Stark, Christian Beliefs and Anti-Semitism. New York: Harper &
Row, 1966, pp.
125-127.
'*T. Amabile, "Social Psychology of Creativity: A Consensual Assessment Technique,"
Journal of Person-
ment between the groups of judges. The artist judges were also asked to evaluate the 22
designs on a number of dimensions including creativity, technical goodness, and aesthetic
appeal. Certain other measures were also used but they need not concern us.
One of the strong pieces of evidence Amabile offered for the construct validity of the
consensual assessment method was the result of a factor analysis of the measures pro-
duced by the artist judges. '^ Factor analysis is essentially a method of finding how varia-
bles cluster together. Amabile's factor analytic results indicated two independent clusters
of variables, which she names "Creativity" and "Technical Goodness." The creativity
measures were those that have been associated with artistic creativity —
novel idea, novel
use, and complexity, for example —
and her consensual assessment measure. The Techni-
cal Goodness measures were: technical goodness rated globally, organization, neatness,
symmetry, and so on. Those variables associated with Creativity clustered together and
those variables associated with Technical Goodness clustered together, but the two clus-
ters were separate and different. Had the two kinds of measures appeared together on one
cluster, the validity of the consensual method of assessing creativity would have been in
doubt because creativity was not supposed to be a function of technical adequacy. The
consensual assessment method of measuring creativity evidently passed the construct
validity test.
"The use of factor analysis, a method that usually requires large numbers of subjects (judges, in this case),
can be questioned. Since subsequent similar factor analyses produced similar results and since we are here
concerned only with the validation method, we omit criticism of the factor analysis.
"*K. Bollen. "Issues in the Comparative Measurement of Political Democracy," American Sociological
Review. 45 (1980). 370-390.
""Indicator." or "Social Indicator," is an important term in contemporary social research. Unfortunately,
there is little agreement on what indicators are. They have been variously defined as indices of social conditions.
statistics, even variables. In Bollen's paper, they are variables. For a discussion of definitions, see R. Jaeger.
"About Educational Indicators: Statistics on the Conditions and Trends in Education." In L. Shulman. ed..
Review of Research in Education, vol. 6. Itasca, III.: Peacock, 1978, chap. 7.
'"See K. Bollen, "Political Democracy and the Timing of Development." American Sociological Review,
44 (1979), 572-587, especially Appendix, for a detailed description of the Index and its scoring.
Validity • 427
lic, 38.7; Sweden, 99.9; Soviet Union, 18.2; Israel, 96.8. Bollen has evidently success-
In addition to the multitrait-multimethod approach and the methods used in the above
studies, there are other methods of construct validation. Any tester is familiar with the
technique of correlating items with total scores. In using the technique, the total score is
assumed to be valid. To the extent that an item measures the same thing as the total score
does, to that extent the item is valid.-'
any measure, it is always helpful to correlate
In order to study the construct validity of
the measure with other measures. The amorism example discussed earlier illustrated
the method and the ideas behind it. But, would it not be more valuable to correlate a mea-
sure with a large number of other measures? How better to learn about a construct than to
know its correlates? Factor analysis is a refined method of doirig this. It tells us, in ef-
fect, what measures measure the same thing and to what extent they measure what they
measure.
Factor analysis is a powerful and indispensable method of construct validation. We
encountered its use in Amabile's study and pointed out that Bollen used it in his validation
of the Index of Political Democracy. Although it has been briefly characterized earlier and
will be discussed in detail in a later chapter, its great importance in validating measures
warrants characterization here. It is a method for reducing a large number of measures to
-'
For a discussion of item analysis, see J. Nunnally, Psychometric Theon\ 2d ed. New York: McGraw-Hill,
1978, pp. 261ff.
"A. Sorenson, T. Husek, and C. Yu. Divergent Concepts of Teacher Role: An Approach to the Measure-
ment of Teacher Effectiveness," Journal of Educational Psychology, 54 (1963), 287-294.
428 • Measurement
r„ = — (27.1)
Val = —V,
(27.2)
where Val is the validity; V^g the common factor variance; and V, the total variance of a
measure. Validity is thus seen as the proportion of the total variance of a measure that is
tion. An understanding of so-called factor theory is required, but factor theory will not be
discussed until later in the book. Despite this difficulty, we must attempt an explanation of
validity in variance terms if we view of the subject. Besides,
are to have a well-rounded
expressing validity and reliability mathematically will unify and clarify both subjects.
Indeed, reliability and validity will be seen to be parts of one unified whole.
Common factor variance is the variance of a measure that is shared with other meas-
ures. In other words, common factor variance is the variance that two or more tests have
in common.
In contrast to the common factor variance of a measure is its specific variance. V^p' the
systematic variance of a measure that is not shared by any other measure. If a test meas-
ures skills that other tests measure, we have common factor variance; if it also measures a
skill that no other test does, we have specific variance.
Figure 27.1 expresses these ideas and also adds the notion of error variance. The A and
B circles represent the variances of Tests A and B. The intersection of A and B,A fl 6, is
^'The variance treatment of validity presented here is an extension of the treatment of reliability presented in
Both treatments follow J. Guilford, Psychometric Methods, 2d ed. New York: McGraw-Hill,
the last chapter.
1954, pp. 354-357.
Validity • 429
the relation of the two sets. Similarly, V{A n B) is common factor variance. The specific
variances and the error variances of both tests are also indicated.
From this viewpoint, then, and following the variance reasoning outlined in the last
chapter, any measure's total variance has several components: common factor variance,
specific variance, and error variance. This is expressed by the equation:
To be able to talk of proportions of the total variance, we divide the terms of Equation
27.3 by the total variance;
^=^+^+^
V, V, V, V,
(27.4)
How do Equations 27.1 and 27.2 fit into this picture? The first term on the right,
Vco /V„ is the right-hand member of (27.2). Therefore validity can be viewed as that part
of the total variance of a measure that is not specific variance and not error variance. This
is easily seen algebraically:
— V,
- —-^-^
V, V, V,
(27.5)
r„ = 1 - -^ (27.6)
r„ = —^ - -^ (27.7)
" V, V,
The right-hand side of the equations, however, is part of the right-hand side of (27.5). If
we rewrite (27.5) slightly, we obtain
V V V
(27.8)
V, V, V, V,
This must mean, then, that validity and reliability are close variance relations. Reliability
is equal to the first two right-hand members of (27.8). So, bringing in (27.1):
,„ = Zl_Z. = Z^ (27.9)
" V, V, V,
V V V
_!^ = Jl!l__!> (27.10)
V, V, V,
us assume that we have a method of determining the common factor variance (or vari-
ances) of a test. (Later we shall see that factor analysis is such a method.) For simplicity
suppose that there are two sources of common factor variance in a test and no others. —
Call these factorsA and B. They might be verbal ability and arithmetic ability, or they
might be liberal attitudes and conservative attitudes. If we add the variance of A to the
variance of 6, we obtain the common factor variance of the test, which is expressed by the
equations,
Vco = Va + Vb (27.11)
Vco Va Vb
-^ = ^^f ''''''
Val =
Va
-^ + —Vb (27.13)
V, V,
The total variance of a test, we said before, includes the common factor variance, the
variance specific to the test and to no other test (at least as far as present information
goes), and error variance. Equations 27.3 and 27.4 express this. Now, substituting in
(27.4) the equality of (27.12), we obtain
—V,^ = ^^
Va
+ —^ + -^^ + —^
Vb V,„ V,
(27. 14)
V, y, V, V, V,
The first two terms on the right-hand side of (27. 14) are associated with the validity of
the measure, and the first three terms
on the right are associated with the reliability of the
measure. These relations have been indicated. Common factor variance, or the validity
component of the measure, is labeled h~ {communal it}'), a symbol customarily used to
indicate the common factor variance of a test. Reliability, as usual, is labeled r„.
To discuss all the implications of this formulation of validity and reliability would take
us too far astray at this time. All that is needed now is to try to clarify the formulation with
a diagram and a brief discussion.
Figure 27.2 is an attempt to express Equation 27.14 diagrammatically. The figure
represents the contributions of the different variances to the total variance (taken to be
equal to 100 percent). Four variances, three systematic variances and one error variance,
comprise the total The contribution of each of the
variance in this theoretical model.-'*
sources of variance is indicated. Of 80 percent is reliable variance. Of
the total variance,
the reliable variance, 30 percent is contributed by Factor A and 25 percent by Factor B,
and 25 percent is specific to this test. The remaining 20 percent of the total variance is
error variance.
The test may be interpreted as quite reliable, since a sizable proportion of the total
variance is "true" variance. The interpretation of validity is more difficult. If
reliable or
there were only one factor, say A, and it contributed 55 percent of the total variance, then
we could say that a considerable proportion of the total variance was valid variance. We
would know that a good bit of the reliable measurement would be the measurement of the
-''Nalurally, practical outcomes never look this neat. It is remarkable, however, how well the model works.
The variance thinking, too, is valuable in conceptualizing and discussing measurement outcomes.
Validity • 431
K„(80%)
^
;i^=K„(55%)
K,(20%)
\
^
concerned with the nature of "reality" and the nature of the properties being measured, is
heavily philosophical.
Despite the difficulties of achieving reliable and valid psychological, sociological, and
educational measurements, great progress has been made in this century. There is growing
understanding that measuring instruments must be critically and empirically examined
all
for their reliability and validity. The day of tolerance of inadequate measurement has
ended. The demands imposed by professionals, the theoretical and statistical tools avail-
-'Note that even if we thought the test was measuring only A. predictions to a criterion might well be
successful, especially if the criterion had a lot of both A and B in it. The test could have predictive validity even
though its construct validity was questionable.
432 • Measurement
able and rapidly being developed, and the increasing sophistication of graduate students of
psychology, sociology, and education have set new high standards that should be healthy
stimulants both to the imaginations of research workers and to developers of scientific
measurement.
Study Suggestions
1. The measurement literature is vast. The following references have been chosen for their
particular excellence or their relevance to important measurement topics. Some of the discussions,
however, are technical and difficult. The student will find elementary discussions of reliability and
validity in most measurement texts.
Cronbach and Meehl, construct validity article. (See footnote 1 and Mehrens and Ebel,
below.) A
most important contribution to modem measurement and behavioral research.
CuRETON, E. "Measurement Theory." In R. Ebel. V. Noll, and R. Bauer, eds.. Encyclopedia
of Educational Research, 4th ed. New York: Macmillan. 1969. pp. 785-804. A broad and
firm overview of measurement, with an emphasis on educational measurement.
Guilford and Nunnally texts. (Footnotes 21 and 23.) Excellent advanced texts.
Standards for Educational and Psychological Tests. Washington, D.C.: American Psychologi-
cal Association, 1974. A definitive statement jointly produced by three large associations
concerned with measurement.
Thorndike, R., ed. Educational Measurement, 2d ed. Washington, D.C.: American Council
on Education, 1971 An outstanding achievement that follows a distinguished predecessor:
.
urement, including reliability and validity. The reliability chapters in both editions, by
Thorndike (1951) and Stanley (1971), have exceptionally good tables (original table by
Thorndike) summarizing the possible sources of variance in measures: Table 8, p. 568,
1951 edition: Table 13.1, p. 364. 1971 edition.
Tryon, R. "Reliability and Behavior Domain Validity: A Reformulation and Historical Cri-
tique." Psychological Bulletin, 54 (1957), 229-249. This is an excellent and important
article on reliability. It contains a good worked example.
The following anthologies of measurement articles are valuable sources of the classics in the
field.This is especially tme of the Mehrens and Ebel and the Jackson and Messick volumes.
Anastasi, a., ed. Testing Problems in Perspective. Washington, D.C.: American Council on
Education, 1966.
Chase, C and Ludlow, G., eds. Readings in Educational and Psychological Measurement.
Boston: Houghton Mifflin, 1966.
Jackson, D., and Messick, S., eds. Problems in Human Assessment. New York: McGraw-
Hill, 1967.
Mehrens, W. and Ebel, R. eds.
, , Principles of Educational and Psychological Measurement.
Skokie, III.: Rand McNally, 1967.
response sets on measurement mstrumenls are quite strong in tlieir statements. A eonsiderable dash
of salt has been thrown on the response-set tail by L. Rorer; "The Great Response-Style Myth,"
Psychological Bulletin, 63 (1965), 129-156.
The position taken in this book is that response sets certainly operate and sometimes have
considerable et't'ect but that the strong claims of advocates are exaggerated. Most of the variance in
well-constructed measures seems to be due to variables being measured and relatively little to
response set. Investigators must be aware of response sets and their possible deleterious effects on
measurement instruments, but they should not be afraid to use the instruments. If one were to take
too seriously the schools of thought on response set and on what has been called the experimenter
effect (in education, the Pygmalion effect), discussed earlier, one would have to abandon behavioral
research, except, perhaps, research that can be done with so-called unobtrusive measures.
4. Discuss and criticize the following statements:
(a) "The reliability of my creativity test is .85. 1 can therefore be reasonably sure that I am
measuring creativity."
(b) "My creativity test really measures creativity, because I had an expert on creativity
carefully screen all the items of the test."
(c) "Since the reliability of the test of logical thinking is only .40. its validity is negligible.
5. Study the following assertions and decide in each case whether the assertion refers to relia-
bility or validity, or both. Label the type of reliability and validity.
(a) "The test was given twice to the same group. The coefficient of correlation between the
scores of the two administrations was .90."
(b) "Four teachers studied the items of a test for their relevance to the objectives of the
curriculum."
(c) "The items seem to be a good sample of the item universe."
(d) "Between a test of academic aptitude and grade-point averages, r = .55."
(e) "The mean difference between Republicans and Democrats on the conservatism instru-
ment was highly significant."
6. Imagine that you have given a test of six items to six persons. The scores of each person on
each item are given below. Say that you have also given another test of six items to six persons.
These scores are also given below. The scores of the first test, I, are given on the left; the scores of
the second test, II, are given on the right.
I
434 • Measurement
dures and the reasoning behind them? Would the effect of changing the orders of, say,
five to ten items have affected the r„'s as much as in these examples? If not, why not?
[Answer: (a) I: f„cms = 3.79 (.05); fp,„„„, = 20.44 (.001); II: F,„^, = 1.03 (n.s);
^persons = 1-91 (n.s). (b) I: r„ = .95; II: r„ = .48.]
7. An important development of the last decade is criterion-referenced measurement, a large
and controversial subject. Since its basic use is in applied educational measurement, and since this
book's emphasis is on scientific research in the behavioral sciences generally, it has not been
discussed. The following references will be helpful to the student of education:
Thorndike and Hagen text (footnote 4), chap. 6 and pp. 658-661 An elementary discussion. .
Jl Methods of
OBSERVATION
AND DATA
COLLECTION
Introduction
The following chapters, then, have three main purposes. The first is to acquaint
the student with the most important observational methods that are available. Gradu-
ate students seem to concentrate on two or three methods, perhaps because of lack of
familiarity with available methods. This restriction to two or three methods unduly
narrows the range of possible problems and inquiry. Thus one of the prime objectives
of these chapters is to broaden the student's knowledge of available methods.
The second purpose is to help the student understand the main characteristics and
purposes of the methods. Methods differ considerably in what they can and cannot
do. Users of methods must know these possibilities if they are to be able to choose
methods suited to their problems. Many a good problem has suffered from an inap-
propriate and inadequate method.
The third purpose is closely related to the second: to indicate, if incompletely, the
strengths and weaknesses of the methods. One method may be well suited to a prob-
lem, but it may have grave weaknesses that disqualify it. The mail questionnaire is a
case in point. A problem may require a wide geographical sampling of schools,
which can be easily accomplished by the mail questionnaire. But its well-known
weakness would perhaps disqualify it from consideration, unless it were the only pos-
sible way to obtain data.
Chapter 28
Interviews and
Interview Schedules
The interview perhaps the most ubiquitous method of obtaining information from
is
people. It is still used in all kinds of practical situations: the lawyer obtains
has been and
information from a client; the physician learns about a patient; the admissions officer or
professor determines the suitability of students for schools, departments, and curricula.
Only recently, however, has the interview been used systematically for scientific pur-
poses, both in the laboratory and in the field.
Data-collection methods can be categorized by the degree of their directness. If we
wish to know something about people, we can ask them about it directly. They may or
may not give us an answer. On the other hand, we may not ask a direct question. We may
use an ambiguous stimulus, like a blurred picture, a blot of ink, or a vague question; and
then ask for impressions of the stimulus on the assumption that the respondents will give
the needed information without knowing they are giving it. This method is highly indirect.
Most of the data-collection methods used in psychological and sociological research are
relatively direct or moderately indirect. Rarely are highly indirect means used.
Interviews and schedules (questionnaires) are ordinarily quite direct. This is both a
strength and a weakness. It is a strength because a great deal of the information needed in
social scientific research can be gotten from respondents by direct questions. Though the
questions may have to be carefully handled, respondents can, and usually will, give much
information directly. There is information, however, of a more difficult nature that re-
spondents may be unwilling, reluctant, or unable to give readily and directly, for exam-
ple, information on income, sexual relations, and attitudes toward religion and minority
groups. In such cases, direct questions may yield data that are invalid. Yet, properly
440 • Methods of Observation and Data Collection
handled, even personal or controversial material can be successfully obtained with inter-
views and schedules.
The interviewis probably man's oldest and most often used device for obtaining
information. has important qualities that objective tests and scales and behavioral obser-
It
vations do not possess. When used with a well-conceived schedule, an interview can
obtain a great deal of information,is flexible and adaptable to individual situations, and
can often be used when no other method is possible or adequate. These qualities make it
especially suitable for research with children.' An interviewer can know whether the
respondent, especially a child, does not understand a question and can, within limits,
repeat or rephrase the question. Questions about hopes, aspirations, and anxieties can be
asked in such a way as to elicit accurate information. Most important, perhaps, the
interview permits probing into the context and reasons for answers to questions.
The major shortcoming of the interview and its accompanying schedule is practical.
Interviews take a lot of time. Getting information from one individual may take as long as
an hour or even two hours. This large time investment costs effort and money. So,
whenever a more economical method answers the research purposes, interviews should
not be used.
'L. Yarrow, "'Interviewing Children." In P. Mussen, ed.. Handbook of Research Methods in Child Devel-
opment. New York; Wiley, 1960, chap. 14.
-The student will find detailed guidance in: C. Cannell and R. Kahn, "Interviewing." In G. Lindzey and
E. Aronson, eds.. The Handbook of Social Psychology. 2d ed. Reading, Mass.: Addison-Wesley, 1968, vol. II,
chap. 15. See. also. D. Warwick and C. Lininger, The Sample Survey: Theorx and Practice. New York;
McGraw-Hill, 1975, chap. 7.
Interviews and Interview Schedules • 441
The Interview
The Inteniew is a face-to-face interpersonal role situation in which one person, the inter-
viewer, asks a person being interviewed, the respondent, questions designed to obtain
answers pertinent to the research problem. There are two broad types of interview: struc-
tured and unstructured or standardized and unstandardized .'' In the standardized inter-
view, the questions, their sequence, and their wording are fixed. An interviewer may be
allowed some liberty in asking questions, but relatively little.'* This liberty is specified in
advance. Standardized interviews use interview schedules that have been carefully pre-
pared to obtain information pertinent to the research problem.
Unstandardized interviews are more flexible and open. Although the research pur-
poses govern the questions asked, their content, their sequence, and their wording are in
the hands of the interviewer. Ordinarily no schedule is used. In other words, the unstan-
dardized, nonstructured interview is an open situation in contrast to the standardized,
structured interview, which is a closed situation. This does not mean that an unstan-
dardized interview is casual. It should be just as carefully planned as the standardized one.
Our concern here is mainly with the standardized interview. It is recognized, however,
that many research problems may, and often do, require a compromise type of interview
in which the interviewer is permitted to use alternate questions that he judges fit for
particular respondents and particular questions.^
ble prior study and practice. There are several reasons for this, the main ones probably
being the multiple meaning and ambiguity of words, the lack of sharp and constant focus
on the problems and hypotheses being studied, a lack of appreciation of the schedule as a
measurement instrument, and a lack of necessary background and experience.
Three kinds of information are included in most schedules: face sheet (identification)
open-end (or open). A third type of item, having fixed alternatives, is also used: scale
items.
Fixed-Alternative Items
Fixed-alternative items, as thename indicates, offer the respondent a choice among two or
more These items are also called closed or poll questions. The commonest
alternatives.
kind of fixed-alternative item is dichotomous: it asks for Yes-No, Agree-Disagree, and
other two-alternative answers. Often a third alternative, Don't Know or Undecided, is
added.
Two examples of fixed-alternative items follow:*
There are always some people whose ideas are considered bad or dangerous by other people, for
instance, somebody who is against all churches and religion. If such a person wanted to make a
speech in your city (town, community) against churches and religion, should he be allowed to
speak, or not?
Yes D
No D
Don't know D
If the school board in your community were to say, some day, that there were no Communists
teaching in your schools, would you feel pretty sure it was true, or not?
Would feel it was true D
Would not n
Don't know D
Although fixed-alternative items have the decided advantages of achieving greater
uniformity of measurement and thus greater reliability, of forcing the respondent to an-
swer in a way that fits the response categories previously set up, and of being easily
coded, they have certain disadvantages. The major disadvantage is their superficiality:
Without probes they do not ordinarily get beneath the response surface. They may also
irritate a respondent who finds none of the alternatives suitable. Worse, they can force
responses. A respondent may choose an alternative to conceal ignorance. Or he may
choose alternatives that do not accurately represent true facts or opinions. These difficul-
ties do not mean that fixed-alternative items are bad and useless. On the contrary, they can
be used to good purpose if they are judiciously written, used with probes, and mixed with
open items.''
Open-End Items
Open or open-end items are an extremely important development in the technique of
interviewing. Open-end questions are those that supply a frame of reference for respon-
dents' answers, but put a minimum of restraint on
the answers and their expression. While
their content is by the research problem, they impose no other restrictions on the
dictated
content and manner of respondent answers. Examples will be given later.
Open-end questions have important advantages, but they have disadvantages, too. If
properly written and used, however, these disadvantages can be minimized. Open-end
'From Communism, Conformity, and Civil Liberties by Samuel A. Stouffer. Garden City. N.Y.: Double-
day, 1955, pp. 252 and 256. Copyright © 1955 by Samuel A. Stouffer. Reprinted by permission of Doubleday
& Company, Inc.
'See Warwick and Lininger, op. cit., pp. 210-215. A probe is a device used to find out respondents'
information on a subject, their frames of reference, or. more usually, to clarify and ascertain reasons for
responses given. Probing increases the "response-getting" power of questions without changing their content.
Examples of probes are: "Tell me more about that." "How is that?" "Can you explam that?"
Interviews and Interview Schedules • 443
questions are flexible; they have possibilities of depth; they enable the interviewer to clear
up misunderstanding (through probing), to ascertain a respondent's lack of knowledge, to
detect ambiguity, to encourage cooperation and achieve rapport, and to make better esti-
mates of respondents' true intentions, beliefs, and attitudes. Their use also has another
advantage: the responses to open-end questions can suggest possibilities of relations and
hypotheses. Respondents will sometimes give unexpected answers that may indicate the
existence of relations not originally anticipated.
A special type of open-end question is the funnel. Actually, this is a set of questions
directed toward getting information on a single important topic or a single set of related
topics.The funnel starts with a broad question and narrows down progressively to the
important specific point or points.** Warwick and Lininger point out that the merits of the
funnel is that it allows free response in the earlier questions, narrows down to specific
questions and responses, and also facilitates the discovery of respondents' frames of
reference.'* Another form of funnel starts with an open general question and follows this
up with specific closed items. The best way to get a feeling for good open-end questions
and funnels is to study examples.
To obtain information on child-rearing practices, Sears, Maccoby, and Levin used a
number of good open-end and funnel questions. One of them, with the authors' comments
in brackets, is:
All babies cry, of course. [Note that the interviewer puts the parent at ease about her child's
crying.] Some mothers feel that if you pick up a baby every time it cries, you will spoil it.
Others think you should never baby cry for very long. [The frame of reference has been
let a
clearly given. The mother is also put at ease no matter how she handles her baby's crying.]
How do you feel about this?
(a) What did you do about this with X?
(b) How about in the middle of the night?'"
This funnel question set not only reaches attitudes; it also probes specific practices.
Scale Items
A third type of schedule item is the scale item. A scale is a set of verbal items to each of
which an individual responds by expressing degrees of agreement or disagreement or
some other mode of response. Scale items have fixed alternatives and place the respond-
ing individual at some point on the scale. (They will be discussed at greater length in
Chapter 29.) The use of scale items in interview schedules is a development of great
promise, since the benefits of scales are combined with those of interviews. We can
include, for example, a scale to measure attitudes toward education in an interview sched-
ule on the same topic. Scale scores can be obtained in this way for each respondent and
can be checked against open-end question data. One can measure the tolerance of noncon-
formity, as Stouffer did, by having a scale to measure this variable embedded in the
interview schedule."
In the last decade, survey researchers have increasingly used scale items. They
are
used heavily, for example, in the large and important survey on the quality of American
life done by the Survey Research Center of the University of Michigan.''
Here is an
491-493.
"Stouffer, op. cii.. App. C.
'-A. Campbell, P. Converse, and W. Rodgers, The Quality of American Life. New York: Russell Sage
Foundation, 1976.
444 • Methods of Observation and Data Collection
interesting example of an item that combines both open and closed approaches, the latter
with a scale item. Note the branching depending on how the respondent answers the
"Some people ..." item. Note, too, the crucial item (C6) on satisfaction with life in
the United States.
C5. Some people say there isn't as much freedom in this country as there ought to be. How
about you — how free do you feel to live the kind of life you want — very free, free enough,
not very free, or not free at all?
1 . VERY FREE 2. FREE ENOUGH 3, NOT VERY FREE 4. NOT FREE AT ALL
GO TO C6 GO TO C6
C6. (HAND R CARD 3, WHITE) All things considered, how satisfied are you with life in the
United States today? Which number comes closest to how satisfied or dissatisfied you
feel?^^
Brief comments are appended to the questions. When confronted with the actual necessity
of drafting a schedule, the student should consult more extended treatments, since the
ensuing discussion, in keeping with the discussion of the rest of the chapter, is intended
only as an introduction to the subject.''*
1 research problem and the research objectives? Except
Is the question related to the
and sociological information questions, all the items of a schedule should have
for factual
some research problem function. This means that the purpose of each question is to elicit
information that can be used to test the hypotheses of the research.
2. Is the type of question appropriate? Some information can best be obtained with
the open-end question —
reasons for behavior, intentions, and attitudes. Certain other
information, on the other hand, can be more expeditiously obtained with closed questions.
If all that is required of a respondent is preferred choice of two or more alternatives, and
these alternatives can be clearly specified, it would be wasteful to use an open-end ques-
tion.'^
3. Is the item clear and unambiguous? An ambiguous statement or item is one that
permits or invites alternative interpretations and differing responses resulting from the
alternative interpretations. So-called double-barreled questions are ambiguous, for exam-
"Ibid.. p. 527. In item C6, 1 = highly satisfied, 7 = highly dissatisfied, and 4 = neutral.
'"For practical guidance, see Warwick and Lininger, op. cit.. chap. 6; E. Noelle-Neuman, "Wanted: Rules
for Wording Structured Questionnaires," Public Opinion Quarterly. 35 (1970), 191-201.
"See H. Schuman and S. Presser, "The Open and Closed Question," American Sociological Review. 44
(1979), 692-712; Dohrenwend and Richardson, op. cit.: Warwick and Lininger, op. cit., pp. 132-140.
Interviews and Interview Schedules • 445
pie, because they provide two or more frames of reference rather than only one. Cannell
and Kahn give a "good" example of a double-barreled and thus ambiguous question:
"How do you feel about the development of a rapid transit system between the central city
and the suburbs, and the redevelopment of central city residential areas?"?'" Respon-
dents, even if not baftled by the complexity and alternatives offered by this question, can
hardly respond using a common frame of reference and understanding of what is wanted.
But ambiguity can arise from much simpler questions: "How are you and your family
getting along this year?" Does the questioner mean finances, marital happiness, health,
status, or what?
A great deal of work has been done on item writing. Certain precepts, if followed,
help the item writer avoid ambiguity. First, questions that contain more than one idea to
which a respondent can react should be avoided. An item like "Do you believe that the
educational aims of the modern high school and the teaching methods used to attain these
aims are educationally sound?" is an ambiguous question, because the respondent is
asked about both educational aims and teaching methods in the same question. Second,
avoid ambiguous words and expressions. A respondent might be asked the question, "Do
you think the teachers of your school get fair treatment?" This is an ambiguous item
because "fair treatment" might refer to several different areas of treatment. The word
"fair." too, can mean "just," "equitable," "not too good," "impartial," and "objec-
tive," The question needs a clear context, an explicit frame of reference. (Sometimes,
however, ambiguous questions are deliberately used to elicit different frames of refer-
ence.)
a leading question? Leading questions suggest answers. As such,
4. Is the question
they threaten validity. If you ask a person "Have you read about the local school situa-
tion?" you may get a disproportionately large number of "yes" responses, because the
question may imply that it is bad not to have read about the local school situation.
5. Does the question demand knowledge and information that the respondent does not
are socially desirable, responses that indicate or imply approval of actions or things that
are generally considered to be good. We may ask a person about his feelings toward
children. Everybody is supposed to love children. Unless we are careful, we will get a
stereotyped response about children and love. Also, when we ask if a person votes, we
must be careful since everyone is supposed to vote. If we ask respondents their reactions
to minority groups, we again run the risk of getting invalid responses.
Most educated
people, no matter what their "true" attitudes, are aware that prejudice is disapproved. A
good question, then, is which respondents are not led to express merely socially
one in
desirable sentiments. At the same time, one should not question respondents so that
they
same question frequently has different meanings for different people. As we saw, this can
be handled in the interview. But we are powerless to do anything about it when the
instrument is self-administered. Third, if only closed items are used, the instrument dis-
plays the same weaknesses of closed items discussed earlier. On the other hand, if open
items are used, the respondent may object to writing the answers, which reduces the
sample of adequate responses. Many people cannot express themselves adequately in
writing, and many who can express themselves dislike doing so.
Because of these disadvantages, the interview is probably superior to the self-adminis-
tered questionnaire. (This statement does not include carefully constructed personality and
attitude scales.) The best instrument available for sounding people's behavior, future
intentions, feelings, attitudes, and reason for behavior would seem to be the structured
interview coupled with an interview schedule that includes open-end, closed, and scale
items. Of course, the structured interview must be carefully constructed and pretested and
be used only by skilled interviewers. The cost in time, energy, and money, and the very
high degree of skill necessary for its construction, are its main drawbacks. Once these
disadvantages are surmounted, the structured interview is a powerful tool.
ADDENDUM
Examples of the Interview as a Research Tool
This chapter has emphasized the use of the interview in survey research. Its value as a
primary or supplementary method in all kinds of studies must be pointed out. When
information is difficult to get with other methods and when it is necessary to probe or go
Interviews and Interview Scliedules • 447
deep, the interview can be invaluable. When a new area is being explored, interviewing
may be useful to obtain leads on hypotheses, variables, and items. When research is done
with small children, interviewing may be the only way
have them communicate. Rather
to
than labor these points, we give examples to supplement those given in the chapter.
In her impressive study of attitudes toward rape mentioned in an earlier chapter, Burt
used interviews whose questions were mostly closed and, most important, scales to mea-
sure several variables considered to be important in supporting acceptance of rape: sex
role stereotypy, adversarial sexual beliefs, and sexual conservatism, for example.'^ The
open-ended questions focused on the subjects' experiences with sexual assault, for exam-
ple: "Have you ever had anyone force sex on you against your will?"" Actually, Burt did
not use open-ended questions in the usual sense since all her questions required only brief
responses with no elaboration. This use of the interview well illustrates what will probably
be an increasing trend that we noticed before: generous use of scales in interviews to
measure the variables under study. In earlier years the use of scales in interviews was
relatively rare. Social scientific researchers today want the advantages of the open-ended
question, the closed question, and the scale. Burt"s almost exclusive use of closed ques-
tions and scales illustrates recent increased preoccupation with the validity and reliability
of variable measurement. I think we can also anticipate social scientific researchers'
increasing use of scales that are administered to random samples by interviewers working
on entirely different research projects. For example, a social attitude scale used in a
cross-cultural attitude study was administered to a random sample of The Netherlands by
a commercial survey research organization in one of its weekly surveys."*
In their study of children's perceptions of social stratification, Simmons and Rosen-
berg used three-hour interviews as their primary method of determining third- to twelfth-
grade children's awareness of stratification and their views of the American opportunity
structure." One question aimed at the latter was: "Do all kids in America have the same
chance to grow up and get the good things in life, or Do some kids not have as good a
chance as others, or Don't you know."
Study Suggestions
1. There are valuable references on the interview and the interview schedule. A few are given
below. Those marked with an asterisk will probably be of most help to the reader whose knowledge
of the field is limited.
*Cannell and Kahn; *Warwick and Lininger: see footnote 2, above. * Interviewer' s Manual,
Surxev Research Center, rev. ed. Ann Arbor: Institute for Social Research, University of
Michigan, 1976. An excellent guide to the practical aspects of interviewing.
Kahn, R., and Cannell, C. The Dynamics of Interviewing. New York: Wiley, 1957. An
authoritative book based on the University of Michigan Survey Research Center (see
above) theory and practice.
Richardson, S.. Dohrenwend, B., and Klein, D. Interviewing. New York: Basic Books,
1965.
'^M. Burt, -Culmral Myths and Support for Rape," Journal of Personality and Social Psychology. 38
(1980). 217-230.
'*F. Kerlinger, C. Middendorp, and J. Amon. "The Stnicture of Social Attitudes in Three
Countries; Tests
"R. Simmons and M. Rosenberg. "Functions of Children's Perceptions of the Stratification System,"
American Sociological Review. 36 (1971), 235-249.
448 • Methods of Observation and Data Collection
Campbell, A.. Converse, P., and Rodgers, W. The Qualm of American Life. New York:
Russell Sage Foundation, 1976, app. B. A long schedule with many scale items and
careful interviewer instructions. Also a substantively important study.
Free, L., and Cantril, H. The Political Beliefs of Americans New Brunswick, N. J.; Rutgers
.
University Press, 1967, app. B. Has good questions and probes and fixed-alternative
items.
Clock, C, and Stark, R. Christian Beliefs and Anti-Semitism. New York: Harper & Row,
1966. The complete schedule, mostly with fixed-alternative items, is given at the end of
the book.
Lortie, D. Schoolteachers: A Sociological Study. Chicago: University of Chicago Press, 1975,
app. B. Particularly valuable to educational researchers. Includes probes and suggestions
to interviewers.
Stouffer, Communism. Conformity, and Civil Liberties. Garden City, N. Y.: Doubleday,
S.,
1955, app. B. This schedule contains many fixed-alternative items and scales especially
designed to measure tolerance and perception of Communist threat. See app. C for the
scales.
Verba, S., and Nie, N. Participation in America: Political Democracy and Social Equality.
New York: Harper & Row, 1972. A valuable and sophisticated study using survey data of
2,549 interviews conducted by the National Opinion Research Center in 1967.
3. The examples given in 2, above, are all survey research, the field of research in which the art
and technique of interviewing were developed and used. Interviews, however, can be and have been
used in what can be called "normal" studies, studies whose only or main interest is in pursuing
relations among variables. The Burt study of attitudes toward rape, discussed earlier, is a good
example. Here are four others.
Campbell, A., and Schuman, H. Racial Attitudes in Fifteen Cities. Ann Arbor, Michigan:
Research, University of Michigan, 1968. A combination of "survey"
Institute for Social
and attitudinal questions aimed at understanding racial attitudes and their change.
Doob, a., and Macdonald, G. "Television Viewing and Fear of Victimization: Is the Rela-
tionship Causal?" Journal of Personality- and Social Psychology. 37 (1979), 170-179.
Jackson, P.,Silberman, M., and Wolfson, B. "Signs of Personal Involvement in Teachers'
Descriptions of Their Students," Journal of Educational Psychology. 60 1969), 22-27. (
Naturalistic brief interviews whose results were analyzed to obtain evidence of teachers'
personal involvement with pupils.
Quinn, R., Kahn, R., Tabor, J., and Gordon, L. The Chosen Few:
A Study of Discrimina-
tion in Executive Selection. Ann
Arbor: Institute for Social Research, University of Michi-
gan, 1968. Confidential interviews with executives were used in a study of anti-Semitism.
Chapter 29
Objective Tests
and Scales
The most-used method of observation and data collection in the behavioral sciences is
the test or scale. The considerable time researchers spend in constructing or finding
measures of variables is well spent because adequate measurement of research variables is
at the core of behavioral scientific work. In general, too little attention has been paid to the
measurement of the variables of research studies. What good are intriguing and important
research problems and sophisticated research design and statistical analysis if the variables
of research studies are poorly measured? Fortunately, great progress has been made in
understanding psychological and educational measurement theory and in improving meas-
urement practice. In this chapter we examine some of the technology of objective meas-
urement procedures.
'For an extended discussion of objectivity, see: F. Kerlinger, Behavioral Research: A Conceptual Ap-
proach. New York: Holt. Rinehart and Winston, 1979, pp. 9-13 and 262-264. The importance of understand-
ing objectivity in science cannot be overemphasized. It is especially important to understand that scientific
objectivity is methodological and has little or nothing to do with objectivity as a presumed characteristic of
scientists. Whether a scientist as a person is or is not objective is not the point; the point is that scientific
objectivity inheres in methodological procedures characterized by agreement among expert judges — and nothing
more.
450 • Methods of Observation and Data Collection
sets of objects as anyone else. An objective procedure is one in which agreement among
observers is at a maximum. In variance terms, observer variance is at a minimum. This
means judgmental variance, the variance due
that to differences in judges" assignment of
numerals to objects, approaches zero.
All methods of observation are inferential: inferences about properties of the members
of sets are made on the basis of the numerals assigned to the set members with interviews,
tests, scales,and direct observations of behavior. The methods differ in their directness or
indirectness, in the degree to which inferences are made from the raw observations. The
inferences made by using objective methods of observations are usually lengthy, despite
their seeming directness. Most such methods permit a high degree of inter-observer agree-
ment because subjects make marks on paper, the marks being restricted to two or more
choices among alternatives supplied by the observer. From these marks on paper the
observer infers the characteristics of the individuals and sets of individuals making the
marks. In one class of objective methods, the marks on paper are made by the observer (or
judge) who looks at the object or objects of measurement and chooses between given
alternatives. In this case, too, inferences about the properties of the observed object or
objects are made from the marks on paper. The main difference lies in who makes the
marks.
It should be recognized that all methods of observation have some objectivity. There is
not a sharp dichotomy, in other words, between so-called objective methods and other
methods of observation. There is, rather, a difference in the degree of objectivity. Again,
if we think of degrees of objectivity as degrees of agreement among observers, the ambi-
Achievement Tests
teachers to measure more limited and specific achievements. They may, of course, also be
constructed by educational researchers for measuring limited areas of achievement or
proficiency.
-O. Euros, ed.. The Eighth Menial Measurements Yearbook. Highland Park, N.J.: Gryphon Press, 1978.
See. also. Sax's highly useful presenlalion on published sources of information about tests: G. Sax. Principles of
Educational and Psychological Measurement and Evaluation. 2d ed. Belmont. Calif.: Wadsworth. 1980. pp.
339-355.
452 • Methods of Observation and Data Collection
Standardized achievement tests can also be classified into general and special tests.
General tests are typically batteries of tests that measure the most important areas of
school achievement: language usage, vocabulary, reading, arithmetic, and social studies.
Special achievement tests, as the name indicates, are tests in individual subjects, such as
history, science, and English.
Researchers will often have no choice of achievement tests because school systems
have already selected them. Given choice, however, they must carefully assess the kind of
achievement research problems require. Suppose the research variable in a study is
achievement in understanding concepts. Many, perhaps most, tests used in schools will
not be adequate for measuring this variable. In such cases, researchers can choose a test
specifically designed to measure the understanding of concepts, or can devise such tests
themselves. The construction of an achievement test is a formidable job, the details of
which cannot be discussed here. The student is referred to specialized texts.-'
Personality Measures
anxious person will probably be nervous and disorganized under stress, we might write
items suggesting these conditions in order to measure anxiety. In the a priori method,
then, the scale writer collects or writes items that ostensibly measure personality traits.
This approach is essentially that of early personality test writers; it is a content validity
method that is still used. While there is nothing inherently wrong with the method
indeed, it will have to be used, especially in the early stages of test and scale construction
the results can be misleading. Items do not always measure what we think they measure.
Sometimes we even find that an item we thought would measure, say, social responsibil-
Unfortunately, there are few texts on the construction of tests and scales for research purposes. The many
^
textsand other discussions of measurement focus for the most part on the construction and use of instruments for
applied purposes. Researchers who need to construct achievement measures of one kind or another, however,
guidance in: D, Adkins, Test Cimslruclioii: Developmenl and Inlerpretation of Achievement
will find excellent
Tests.2d ed. Columbus: Charles E. Merrill, 1974. Researchers who need to construct attitude .scales will find
Edwards' book invaluable: A. Edwards. Techniques of Attitude Scale Construction. New York: Appleton-Cen-
tury-Crofts, 1957.
Objective Tests and Scales • 453
ity, actually measures a tendency to agree with socially desirable statements. For this
reason, the a priori methiid. used alone, is insulTicient.
The method ol validation often used with a priori personality scales is the known-
group method. To validate a scale of social responsibility, one might find a group of
individuals known to be high in social responsibility, and another known to be low in
social responsibility. If the .scale differentiates the groups successfully, it is said to have
validity.
A priori persimality and other measures will continue to be used in behavioral re-
search. Their blind and naive use, however, should be discouraged. Their construct and
must be checked, especially through factor analysis and other
criterion-related validities
empirical means. Measures of personality, as well as other measures, have been used too
often merely because users think they measure whatever they are said to measure.
The construct method of personality measure construction emphasizes
or theoretical
the relations of the variable being measured to other variables, the relations prompted by
the theory underlying the research. (See Chapter 27.) While .scale construction must
always to some extent be a priori, the more personality measures are subjected to the tests
of construct validity the more faith we can have in them, it is not enough simply to accept
the validity of a personality scale, or even to accept validity because it has successfully
its
Attitude Scales
Attitudes, while treated separately here and in most textbook discussions, are really an
integral part of personality. (Intelligence and aptitude, too, are considered parts of person-
ality by modern theorists.) Personality measurement, however, is mostly of traits. A trail
situations. If one is dominant, one exhibits dominant behavior in most situations. If one is
anxious, anxious behavior permeates most of one's activities. An attitude, on the other
hand, is an organized predisposition to think, feel, perceive, and behave toward a referent
or cognitive object. It is an enduring structure of beliefs that predisposes the individual to
behave selectively toward attitude referents.'* A referent is a category, class, or set of
phenomena: physical objects, events, behaviors, even constructs.'^ People have attitudes
toward many different things: ethnic groups, institutions, religion, educational issues and
practices, the Supreme Court, civil rights, private property, and so on. One has, in other
words, an attitude toward something "out there." A trait has subjective reference; an
attitude has objective reference. One who has a hostile attitude toward foreigners may be
hostile only to foreigners, but one who has the trait hostility is hostile toward everyone (at
least potentially).
There are three major types of attitude scales: summated rating scales, equal-appear-
ing interx'al scales, and cumulative (or Guttman) scales. A summated rating scale (one
type of which is called a Likert-type scale) is a set of attitude items, all of which are
considered of approximately equal "attitude value," and to each of which subjects re-
"This definition comes from several sources: D. Krech and R. Crutchfield, Theory and Problems of Social
Psychology. New York: McGraw-Hill. 1948, p. 152; T. Newcomb. Social Psychology. New York; Holt.
Rine-
han and Winston. 1950, pp. 118-119; F. Kerlinger, '-Social Attitudes and Their Criterial Referents; A Struc-
Alliludes. and Values. San
tural Theory." Psychological Review. 74 (1967), 110-122; M. Rokeach. Belief.t.
Francisco; jossey-Bass, 1968. p. 112
^R. Brown, Words and Things. New York; Free Press, 1958, pp. 7-10.
454 • Methods of Observation and Data Collection
spond with degrees of agreement or disagreement (intensity). The scores of the items of
such a scale are summed, or summed and averaged, to yield an individual's attitude score.
As in all attitude scales, the purpose of the summated rating scale is to place an individual
somewhere on an agreement continuum of the attitude in question.
It is important to note two or three characteristics of summated rating scales, since
many scales share these characteristics. First, U. the universe of items, is conceived to be
a set of items of equal "attitude value," as indicated in the definition given above. This
means no scale of items, as such. One item is the same as any other item in
that there is
attitude value. The individuals responding to items are "scaled"; this "scaling" comes
about through the sums (or averages) of the individuals' responses. Any subset of U is
theoretically the same as any other subset of U: a set of individuals would be rank-ordered
the same using t/i or U\.
Second, summated rating scales allow for the intensity of attitude expression. Subjects
can agree or they can agree strongly. There are advantages to this, as well as disadvan-
tages. The main advantage is that greater variance results. When there are five or seven
possible categories of response, it is obvious that the response variance should be greater
than with only two or three categories (agree, disagree, no opinion, for example). The
variance of summated rating scales, unfortunately, often seems to contain response-set
variance. Individuals have differential tendencies to use certain types of responses: ex-
treme responses, neutral responses, agree responses, disagree responses. This response
variance confounds the attitude (and personality trait) variance. The individual differences
yielded by summated rating attitude scales (and similarly scored trait measures) have been
shown to be due in part to response set and other similar extraneous sources of variance.^
Here are two summated rating items from a scale constructed by Burt for her study of
attitudes toward rape. They were written to measure sex-role stereotyping. A seven-point
scale ranging from strongly agree (7) to strongly disagree (1) was used. The values in
parentheses (and the values in between) are assigned to the responses indicated.
There is something wrong with a woman who doesn't want to marry and raise a family.
A woman should be a virgin when she marries.'
Thurstone equal-appearing inten'al scales are built on different principles. While the
ultimate product, a set of attitude items, can be used for the same purpose of assigning
individuals attitude scores, equal-appearing interval scales also accomplish the important
purpose of scaling the attitude items. Each item is assigned a scale value, and the scale
value indicates the strength of attitude of an agreement response to the item. The universe
of items is considered to be an ordered set; that is, items differ in scale value. The scaling
procedure finds these scale values. In addition, the items of the final scale to be used are
so selected that the intervals between them are equal, an important and desirable psycho-
metric feature.
The following equal-appearing interval items, with the scale values of the items, are
from Thurstone and Chave's scale. Attitude toward the Church:^
""The response-set literature is large and cannot be cited in detail. Nunnally's discussion is well-balanced:
J.Nunnally. Psychometric Theory. 2d ed. New York: McGraw-Hill, 1978. pp. 655-672. I believe that, while
response set is a mild threat to valid measurement, its importance has been overrated and that the available
evidence does not justify the strong negative assertions made by response-set enthusiasts. In other words, while
one must be conscious of the possibilities and threats, one should certainly not be paralyzed by the somewhat
blown-up danger. See ibid., p. 672. and L. Rorer, "The Great Response-Style Myth," Psychological Bulleiin.
63 (1965). 129-156.
^M. Burt, "Cultural Myths and Supports for Rape," Journal of Per.mnality and Social Psychology. 38
(1980). 217-230 (p. 222).
"t^. Thurstone and E. Chave. The Measurement of Altitude. Chicago: University of Chicago Press. 1929.
1 believe the church is the greatest institution in America today. (Scale value: .2)
I believe in religion, but [ seldom go to church. (Scale value: 5.4)
Ithink the church is a hindrance to religion tor it still depends upon magic, superstition, and
myth. (Scale value: 9.6)
In the Thurstone and Chave scale, the lower the scale value, the more positive the attitude
toward the church. The first and third items were the lowest and highest in the scale. The
second item, of course, had an intermediate value. The total scale contained 45 items with
scale values ranging over the whole continuum. Usually, however, equal-appearing inter-
val scales contain considerably fewer items.
The third type of scale, the cumulative or Guttman scale, consists of a relatively small
set of homogeneous items that are unidimensional (or supposed to be). A imidimensional
scale measures one variable, and one variable only. The scale gets its name from the
cumulative relation between items and the total scores of individuals. For example, we
ask four children three arithmetic questions: (a) 28/7 = ?, (b) 8 x 4 = ?, and (c) 12 +
9= ? A child who gets (a) correct is very likely to get the other two correct. The child
who misses (a), but gets (b) correct, is likely also to get (c) correct. A child who misses
(c),on the other hand, is not likely to get (a) and (b) correct. The situation can be
summarized as follows (the table includes the score of the fourth child, who gets none
correct):
456 • Methods of Observation and Data Collection
Value Scales
Values are culturally weighted preferences for things, ideas, people, institutions, and
behaviors." Whereas attitudes are organizations of beliefs about things "out there,"
predispositions to behave toward the objects or referents of attitudes, values express
preferences for modes of conduct and end-states of existence.'- Words like equality,
and obedience express values. Simply put, values
religion, free enterprise, civil rights,
express the "good," the "bad," the "shoulds," the "oughts" of human behavior. Val-
ues put ideas, things, and behaviors on approval-disapproval continua. They imply
choices among courses of action and thinking.
To give the reader some flavor of values, here are three items.'-* Individuals can be
asked to express their approval or disapproval of the first and second items, perhaps in
summated rating form, and to choose from the three alternatives of the third item.
For his own good and for the good of society, man must be held in restraint by tradition and
authority.
Now more than ever we should strengthen the family, the natural stabilizer of society.
Which of the following is the most important in living the full life: education, achievement,
friendship?
Unfortunately, values have had little scientific study, even though they and attitudes
are a large part of our verbal output, and are probably influential determinants of behav-
ior.The measurement of values has thus suffered. Social and educational values will
probably become the focus of much more theoretical and empirical work in the next
decade, however, since social scientists have become increasingly aware that values are
important influences on individual and group behavior.'**
A number of objective measures do not conveniently fall into one of the above categories,
although they are closely related to one or more of them. We shall consider several of
these measures here to illustrate the variety of work already done and the possible nature
of future objective inferential measurement.
Certain scales are important because of their theoretical value and the frequency of
their use. The well-known F scale is one of these. '^
Designed to measure authoritarianism
^°Ibid., chaps. 7 and 9. Edwards describes how to construct and evaluate cumulative scales, as well as
summated rating and equal-appearing interval scales.
"C. Kluckhohn et al., "Values and Value-Orientations in the Theory of Action." In T. Parsons and
E. Shils. eds.. Toward a General Theory of Action. Cambridge, Mass.: Harvard University Press. 1952. pp.
388-433.
'"Rokeach. op. cit., p. 159 The relations between attitudes and values and the definitions of both are still
not clear in the literature, perhaps because of neglect of value theory and research.
"The first two items were written by the author for a University of Hawaii research project, 1971.
'"See W. Dukes. 'Psychological Studies of Values." Psychological Bulletin. 52
1955). 24-50; R. Hogan,
(
"Moral Conduct and Moral Character." Psychological Bulletin. 79 (1973). 217-232; S. Pittel and G. Men-
delssohn, "Measurement of Moral Values." Psychological Bulletin. 66 (1966), 22-35. A source of values
scales is T. Levitin, "Values." In J. Robinson and P. Shaver, eds.. Measures of Psychological Attitudes. Ann
Arbor: Institute for Social Research, University of Michigan, 1969, chap. 7. A highly suggestive and valuable
essay appeared thirty years ago; Kluckhohn et al, op. cit. Thurstone's essay on values measurement is still
important: L. Thurstone, The Measurement of Values. Chicago: University of Chicago Press, 1959, chap. 17.
'^T. Adomo et al., The Authoritarian Personality. New York: Harper & Row, 1950.
Objective Tests and Scales • 457
(F originally stood tor fascism), it has been called both a personality and an attitude
measure. Probably closer to being a personality measure, the F scale is a summated rating
scale in which subjects are asked to respond to a number ol' general statements, usually 29
or 30. Here are three such items, agreement with which is supposed to indicate authoritar-
ian trends in the respondent.
Ohediencc and respect for authority are the most iinportani virtues children should learn.
What the youth needs niosl is strict work and
discipline, rugged dclerinination, and the will to
fight for family and country.
Science has its place, but there arc many important things that can never possibly be understood
""
by the human mind.
The F scale seems to tap broad general attitudes or cores of values, as well as person-
ality traits. Many differences have been found between high- and low-scoring persons and
groups.'^ More important, however, are the fruitful theoretical reasoning behind the
scale's construction, the empirical approach to an important social and psychological
problem, and the stimulus to research in the measurement of complex variables. The F
scale has fallen into some disrepute under the critical onslaught of psychologists and
sociologists because it allegedly does not hold up under the rigorous application of valid-
ity criteria.Evidence from a study by Kerlinger and Rokeach, however, seems to indicate
that, despite its weaknesses, the F scale was well conceived theoretically and well fash-
ioned as a measure of authoritarianism."*
The measurement of interests is relatively easy. The most important measures are the
Strong-Campbell and the Kuder inventories.''* The reliabilities of both scales are high; the
evidence for their validity seems good.
Naturally, there are many other important and useful tests and scales measures of —
moral judgment, social responsibility, classroom environment, needs, dominance, and so
on. They cannot be discussed here. A few interesting and promising one, however, are
mentioned in Study Suggestion 2 at the end of this chapter. Before ending our formal
discussion of objective tests and scales, we should examine a new development of consid-
erable potential research importance: social indicators.
Social Psychologw 4 (1966), 391-399. The data of this study were reanalyzed and the study replicated by other
researchers with essentially the same results; P. Warr, R. Lee. and K. Joreskog, "A Note on the Factorial Nature
of the F and D Scales," British Journal of Psychology. 60 (1969), 119-123.
'''For descriptions of these scales, see Sax, op. cii.. pp. 473-490.
-"R. Jaeger, "About Educational Indicators: Statistics on the Conditions and Trends in Education." In
L. Shulman. ed. Review of Research in Education, vol, 6. Itasca, III.: Peacock,
.
1978. This chapter is a highly
competent exposition on the definition, status, and use of social and educational indicators.
-'E.g., Bureau of the Census, Social Indicators III. Washington. DC:
Bureau of the Census, 1980; Bureau
Population: 1977
of the Census, Sofia/ and Economic Characteristics of the Metropolitan and Nonmelropolitan
Bureau of the Census, 1978. The latter publication is a valuable source for
and 1970. Washington. D.C.:
sociological and educational researchers, containing as it does comparative statistics on sex, race, education,
occupation, and income.
458 • Methods of Observation and Data Collection
logical and social change, agriculture, mental illness, crime, investments, distribution of
wealth, education, and so on.-' Campbell has emphasized the need for subjective indica-
tors that assess satisfaction with life.--^
We hazard a definition: Social indicators are concepts and associated statistics that
reflect social conditions and human status and that under certain conditions can be used as
variables in behavioral research. Examples are numerous: occupation, religion, birth rate,
divorce, crime, voting, sex, race, unemployment, electricity consumption, alcohol con-
sumption, and so on. Strictly speaking, social indicators are used by action agencies to
assess social conditions and social change, to monitor the achievement of governmental
social goals, and to study human and social conditions in order to understand and improve
them. In this book, however, we stress their potential
use as research variables, or as
components of latent variables. For example, occupation, education, and income are
social indicators that can be combined in some manner to measure the latent variable,
"social class." Crime, unemployment, divorce, and other indicators can themselves be
used as variables or they can be combined to measure, say, the latent variable "social
unrest" or "social malaise." One can say that social indicators "indicate" or point to
desirable and undesirable societal conditions that imply desirable and undesirable social
trends. And this seems to be their chief meaning and use. They can also be profitably used
as observed components of underlying unobserved or latent variables.
Social indicators, as the reader no doubt realizes by now, are usually objective indices
calculated from aggregate statistics, statistics obtained from large social, governmental,
or organizational units. Literacy, for example, might be reported as 95 percent literacy for
such-and-such a population, meaning that 95 percent of that population can read. The use
of so-called subjective indicators, however, will probably increase in the next decade. In
the Survey Research Center study of the quality of American life mentioned in Chapter
24, for example, respondents were asked rather directly their feelings of well-being (see
the questions on freedom and on satisfaction with life in the United States in Chapter 28).
In his essayon subjective measures of well-being (see footnote 23), Campbell calls for use
of such subjective measures, claiming that they go directly to life experience itself. In any
case, both objective and subjective social indicators will probably be used much more in
scientific behavioral research in the next decade, especially in multivariate research that
explores and tests theories and hypotheses on the complex relations among psychological,
sociological, and educational variables.
Educational indicators, similarly, should become increasingly important in scientific
multivariate research on educational phenomena. Educational indicators are social indica-
tors or variables that presumably reflect the state of education on national and local levels
and, more interesting, the relations among social and educational indicators themselves,
their influence on educational achievement, and the assessment of the success of large-
scale educational programs. In a sense, large-scale studies like Equality of Educational
Opportunity depend for their successful prosecution on the intelligent use of both objec-
tive and subjective social and educational indicators.
^^R. Bauer, ed.. Social Indicators. Cambridge, Mass.; MIT Press. 1966. This appears to be the pioneer
book on indicators.
-'A. Campbell, "Subjective Measures of Weil-Being," American Psychologisi. 31 (1976), 117-124.
Objective Tests and Scales • 459
sponse to an item is unrelated to his response to another item. True-false, yes-no, agree-
disagree, and Likert items belong to the independent type. The subject responds to each
item freely with a range of two or more possible responses from which he can choose one.
Nonindependent items, on the other hand, force the respondent to choose one item or
alternative that precludes the choice of other items or alternatives. These forms of scales
and items are called forced-choice scales and items. The subject is faced with two or more
items or subitems and is asked to choose one or more of them according to some criterion,
or even criteria.
Two simple examples will show the difference between independent and noninde-
pendent items. First, a set of instructions that allows independence of response might be
given to the respondent:
Indicate beside each of the following statements how much you approve them, using a scale
from 1 through 5, 1 meaning "Do not approve at all" and 5 meaning "Approve very much."
A contrasting set of instructions, with more limited choices (nonindependent) might be:
Forty pairs of statements are given below. From each pair, choose the one you approve more.
Mark it with a check.
Advantages of independent items are economy and the applicability of most statistical
analyses to responses to them. Also, when each item is responded to, a maximum of
information is obtained, each item contributing to the variance. Less time is taken to
administer independent scales, too, but they may suffer from response-set bias. Individu-
alscan give the same or similar response to each item: they can endorse them all enthu-
siasticallyor indifferently depending on their particular response predilections. The
substantive variance of a variable, then, can be confounded by response set (but see foot-
note 6).
The forced-choice type of scale avoids, at least to some extent, response bias. At the
same time, though, it suffers from a lack of independence, a lack of economy, and
overcomplexity. Forced-choice scales can also strain the subject's endurance and pa-
tience, resulting in less cooperation. Still, many experts believe that forced-choice instru-
ments hold great promise for psychological and educational measurement. Other experts
are skeptical.
Scales and items, then, can be divided into three types: agreement-disagreement (or
approve-disapprove, or true-false, and the like), rank order, forced choice We dis- md .
cuss each of these briefly. Lengthier discussions can be found in the literature.-"*
Agreement-Disagreement Items
^''Edwards, op. cit.. and J. Guilford, Psychometric Methods. 2d ed. New York: McGraw-Hill, 1954.
460 • Methods of Observation and Data Collection
to indicate those items that describe them, items with which they agree, or simply items
that they choose. The adjective check list is a good example. The subject is presented with
a list of adjectives, some indicating desirable traits, like thoughtful, generous, and consid-
erate; and others indicating undesirable traits, like cruel, selfish, and mean. They are
asked to check those adjectives that characterize them. (Of course, this type of instrument
can be used to characterize other persons, too.) A better form, perhaps, would be a list
with all positive adjectives of known scale values from which subjects are asked to select a
specific number of their own personal characteristics. The equal-appearing interval scale
and its response system of checking those attitude items with which one agrees is, of
course, the same idea. The idea is a useful one, especially with the development of factor
scales, scaling methods, and the increasing use of choice methods.
The scoring of agreement-disagreement types of items can be troublesome since not
all the items, or the components of the items, receive responses. (With a summated rating
scale or an ordinary rating scale subjects usually respond to all items.) In general, how-
ever, simple systems of assigning numerals to the various choices can be used. For
instance, "agree-disagree" can be 1 and 0; "yes-'?-no" can be 1,0, -1, or, avoiding
minus signs: 2, 1,0. The responses to the summated rating items described earlier are
simply assigned 1 through 5 or 1 through 7.
The main thing researchers have to keep in mind is that the scoring system has to yield
interpretable data congruent with the scoring system. If scores of 1, 0, —1 are used, the
data must be capable of a scaled interpretation —
that is, 1 is "high" or "most," - is 1
"low" or "least," and is in between. A system of 1, can mean high and low or simply
presence or absence of an attribute. Such a system can be useful and powerful, as we saw
earlier when discussing variables like sex, race, social class, and so on. In sum, the data
yielded by scoring systems have to have clearly interpretable meanings in some sort of
quantitative sense. The student is referred to Ghiselli's discussion of the meaningfulness
of scores.-^
Various systems for weighting items have been devised, but the evidence indicates
that weighted and unweighted scores give much the same results. Students seem to find it
hard to believe this. (Note that we are talking about the weighting of responses to items.)
Although the matter is not completely settled, the evidence is strong that, in tests and
measures of sufficient numbers of items —
say 20 or more —
weighting items differentially
does not make much difference in final outcomes. Nor does the different weighting of
responses make much difference."* It also makes no difference at all, in variance terms, if
you transform scoring weights linearly. You may have subjects use a system, -1-1,0, - 1
and of course, these scores can be used in analysis. But you can add a constant of 1 to each
score, yielding 2, 1,0. The transformed scores are easier to work with, since they have no
minus signs.
The second group of scale and item types is ordinal or rank order, which is a simple and
most useful form of scale or item. A whole scale can be rank-ordered —
that is, subjects
can be asked to rank all the items according to some specified criterion. We might wish to
compare the educational values of administrators, teachers, and parents, for instance. A
number of items presumed to measure educational values can be presented to the members
of each group with instructions to rank order them according to their preferences.
^'E. Ghiselli, Tlieory of Psychological Measurement. New York: McGraw-Hill, 1964. pp. 44-49.
"See Guilford, op. cit.. pp. 447ff.; Nunnally, op. cit.. pp. 296-297.
Objective Tests and Scales • 461
The essence of a forced-choice method is that the subject must choose among alternatives
that on the surface appear about equally favorable (or unfavorable). Strictly speaking, the
method is not new. Pair comparisons and rank-order scales are forced-choice methods.
What is different about the forced-choice method, as such, is that the discrimination and
preference values of items are determined, and items approximately equal in both are
paired. In this way, response set and "item desirability" are to some extent controlled.
("Item desirability" means that one item may be chosen over another simply because it
expresses a commonly recognized is asked if he is careless or
desirable idea. If a person
efficient, he is likely to say he even though he is careless.)
is efficient,
The method of paired comparisons (or pair comparisons) has a long and respectable
psychometric past. It has, however, been used mostly for purposes of determining scale
values.-'' Here we look at paired comparisons as a method of measurement. The essence
of the method is that sets of pairs of stimuli, or items of different values on a single
continuum or on two different continua or factors, are presented to the subject with
instructions to choose one member of each pair on the basis of some stated criterion. The
criterion might be: which one better characterizes the subject, or which does the subject
prefer. The items of the pairs can be single words, phrases, sentences, or even paragraphs.
For example, in his Personal Preference Schedule, Edwards effectively paired statements
that expressed different needs. ^^ One item measuring the need for autonomy, for instance,
is paired with another item measuring the need for change. The subject is asked to choose
one of these items. It is assumed that he will choose the item that fits his own needs. A
unique feature of the scale is that the social desirability values of the paired members were
determined empirically and the pairs matched accordingly. The instrument yields profiles
of need scores for each individual.
In some ways, the two types of paired-comparisons technique, (1) the determining of
scale values of stimuli, and (2) the direct measurement of variables, are the most satisfy-
ing of psychometric methods. They are simple and economical, because there are only
two alternatives. Further, a good deal of information can be obtained with a limited
amount of material. If, for example, an investigator has only 10 items, say 5 of Variable A
and 5 of Variable B. he can construct a scale of 5 x 5 or 25 items, since each A item can
-^E. Taieporos, "Motivational Patterns in Attitudes Towards the Women's Liberation Movement," Journal
of Personalin. 45 (1977), 484-500.
-* Guilford, op. cil.. chap. 8.
-''Ibid., chap. 7.
'°A. Edwards. Personal Preference Schedule. Manual. New York; Psychological Corp., 1953.
462 • Methods of Observation and Data Collection
be systematically paired with each B item. (The scoring is simple: assign a "1" to A or
B in each item, depending on which alternative the subject chooses.) Most important,
paired-comparison items force the subjects to choose. Although this may irk some sub-
what they would choose (that
jects, especially if they believe that neither item represents
is, choosing between coward and weakling to categorize oneself), it is really a customary
human activity. We must make choices every day of our lives. It can even be argued that
agreement-disagreement items are artificial and that choice items are "natural."^'
Forced-choice items of more than two parts can assume a number of forms with three,
four, or five parts, the parts being homogeneous or heterogeneous in favorableness or
unfavorableness. We discuss and illustrate only one of these types to demonstrate the
principles behind such items. By factor analysis, a procedure known as the critical inci-
dents technique, or some other method, items are gathered and selected. It is usually
found that some items discriminate between criterion groups and others do not. Both kinds
of items — call them discriminators and irrelevants —
are included in each item set. In
addition, preference values are determined for each item.
A typical forced-choice item is a tetrad. One useful form of tetrad consists of two pairs
of items, one pair high in preference value, the other pair low in preference value, one
member of each pair being a discriminator (or valid), and the other member being irrele-
vant (or not valid). A scheme of such a forced-choice item is
A subject is directed to choose the item of the tetrad that he most prefers, or that describes
him (or someone else) best, and so on. He is also directed to select the item that is least
preferred or least descriptive of himself.
The basic idea behind this rather complex type of item is, as indicated earlier, that
response set and social desirability are controlled. The subject cannot tell, theoretically at
least, which are the discriminator items and which the irrelevant items; nor can he pick
items on the basis of preference values. Thus the tendencies to evaluate oneself (or others)
-'
too high or too low is counteracted, and validity is therefore presumably increased.
Here is a forced-choice item of a somewhat different type, constructed by the author
for illustrative purposes using items from actual research:
conscientious
agreeable
responsive
sensitive
One of the items (sensitive) is an A item, and one (conscientious) a B item. (A and B refer
The other two items are presumably irrelevant. Subjects can be
to adjectival factors.)
asked to choose the one or two items that are most important for a teacher to have.
Forced-choice methods seem to have great promise. Yet there are technical and psy-
^'
In a study of Adler's concept of social interest (valuing things other than self), Crandall used pair compari-
sons to develop his Social Interest Scale: J. Crandall. "Adler's Concept of Social Interest: Theory, Measure-
ment, and Implications for Adjustment," Journal of Personality and Social Psychology. 39 (1980), 481-495.
Ninety traits were rated by judges for their relevance to social interest: 48 pairs were then used, one member of
each pair having relevance for social interest, the other member not having such relevance. Then, after item
was developed. Unfortunately, Crandall
analysis to determine the most discriminating items, a 15-item scale
does not report the form of the scale. The idea, however, is a good one: he used the strength of paired
comparisons to find good items for a final scale.
'^For further discussion, see Guilford, op. cil.. pp. 274ff.
Objective Tests and Scales • 463
indicating the "first," "highest," or "most," and 5 indicating "last," "lowest," and
"least," with 2, 3, and 4 indicating positions in-between. No matter who uses these
ranks, the sum and mean of the ranks is always the same, 15 and 3, and the standard
deviation is also always the same, 1.414. Ranks, then, are ipsative measures.
If the values 1,2,3,4, and 5 were available for use to rate, say, five objects, ana four
people rated the five objects, we might obtain something like the following:
464 • Methods of Observation and Data Collection
ber. This means lack of independence and negative correlation among items as a function
of the instrumental procedure. Most statistical tests, however, are based on the assump-
tion of independence of the elements entering statistical formulas. And analysis of corre-
lations, as in factor analysis, can be seriously distorted by the negative correlations.
Unfortunately, these limitations have not been understood, or they have been overlooked
by investigators who, for example, have treated ipsative data normatively.''*
'""L. Hicks, "Some Properties of Ipsative, Normative, and Forced-Choice Normative Measures," Psycho-
logical Bulletin, 74 (1970), 167-184. The reader is encouraged to set up a small matri,\ of ipsative numbers
hypothetically generated by responses to a paired-comparisons scale. Use Is and O's and calculate the r's
between items over individuals.
Objective Tests and Scales • 465
tion of thecomplexity of measuring any personality and attitude variables. A second is the
technical advances made in doing so. Another closely allied development is the use of
factor analysis to help identify variables and to guide the construction of measures. A third
development (discussed in Chapter 27) is the increasing knowledge, understanding, and
mastery of the validity problem itself, and especially the realization that validity and
psychological theory are intertwined.
ADDENDUM
Criterion-Referenced Tests, Latent-Trait Theory, and Controversial Issues in
Testing
There are two or three important developments in testing that students of research should
know about. We do not elaborate them
book because they are almost exclusively
in this
concerned with the assessment of educational achievement and mental ability. Neverthe-
less, behavioral researchers have to be aware of them and their importance. We confine
this addendum to a few general remarks and recommended references for further study.
Latent trait too technical to characterize briefly and adequately. It will probably
theory is
affect educational and psychological research and thus needs at least to be mentioned. The
more advanced student will find the following chapters helpful:
466 • Methods of Observation and Data Collection
SuBOViAK, M.. and Baker, F. "Test Theory." In L. Shulman. ed.. Review of Research in
Education, vol. 5. Itasca, III.; Peacock, 1977, chap. 7. More technical but clear discus-
sions of criterion-referenced testing (pp. 277-294), test bias (pp. 294-299), and latent trait
theory (pp. 299-310).
Traub, R., and Wolfe R. "Latent Trait Theories and the Assessment of Educational Achieve-
ment." In D. Berliner, ed.. Review of Research in Education, vol. 9. Washington, D.C.:
American Educational Research Association, 1981. chap. 8.
Test theory and practice have become highly controversial, especially the testing of men-
tal ability. The student is directed to the test bias section of the Suboviak and Baker
chapter, above, and to the following reference:
Berk, R., ed. Handbook of Methods for Detecting Test Bias. Baltimore: Johns Hopkins Univer-
sity Press, 1982.
Block, N., and Dworkin, G., eds. The IQ Controversy. New York; Pantheon, 1976. Valuable
readings by leaders in the controversy.
Loehlin, J., Lindzey, G., and Spuhler, J. Race Differences in Intelligence. San Francisco;
Freeman, 1975. Balanced, dispassionate treatment of a highly charged and difficult sub-
ject.
Professor A. Jensen, a central figure in the controversy, has written an important book on
Reviews by Bouchard and by Bond of Jensen's book, published
bias in mental testing.''^
in Applied Psychological Measurement. 4 (1980). 403-410, are highly recommended
because they succinctly and expertly present the basic issues and problems of the contro-
versy.
Study Suggestions
1. Here are a few references that may help students find their way in the large, difficult, but
'^The reader will constantly encounter the expression "IQ," which is used as a virtual equivalent of
"intelligence." The IQ, or intelligence quotient,was defined as mental age as determined by an intelligence test
divided by the testee's chronological age, multiplied by 100. This quotient has fortunately been abandoned.
Forms of standard scores are now used. Although the standard scores are not IQ's nor is "intelligence" equiva-
lent to "IQ," the use of "IQ" by the popular press and even by social scientists seems fixed and ineradicable.
^'A. Jensen, Bias in Mental Testing. New York: Free Press. 1980.
Objective Tests and Scales • 467
2. To gain insight into the rationale and construction of psychological measuring instruments, it
is helpful to study relatively complete accounts of their development. The following references
describe the development of interesting and important measurement instruments and items.
Allport, G., Vf.rnon, p., and Lindzev, G. Study of Values, rev. ed. Manual of Directions.
Boston: Houghton Miftlin, 1951.
Edwards, A. Personal Preference Schedule, Manual. New York: Psychological Corp., 1933.
Measures needs in a forced-choice format (pair comparisons).
LiKERT. R. ""A Technique for the Measurement of Attitudes." Archives of Psychology, no.
140, 1932. Likert's original monograph describing his technique, an important landmark
in attitude measurement.
Pace.C. CUES: College and Environment Scales. Technical Manual. 2ded. Prince-
Universir\'
A well-developed set of five .scales to mea-
ton, N.J.: Educational Testing Service, 1969.
sure aspects of college environment as seen by students. An especially good example of
careful and competent scale development.
Thurstone, L., and Chave, E. The Measurement of Attitude. Chicago: University of Chicago
Press, 1929. This classic describes the construction of the equal-appearing interval scale to
measure attitudes toward the church.
Verba, S. and Nie, N. Participation in America: Political Democracy and Social Equality.
New York: Harper & Row. 1972. See Appendix C, pp. 365-366, for the measure of
socioeconomic status used in the study, and Appendix B, pp. 355-357, for the standard
measure of participation used in the study.
Woodmansee, J., and Cook, S. "Dimensions of Verbal Racial Attitudes: Their Identification
and Measurement." Journcd of Personality and Social Psychology. 7 (1967), 240-250.
Probably the best measure of attitudes toward blacks. The inventory is given in the Robin-
son, Rusk, and Head volume cited in Study Suggestion 3, below.
3. Here are three useful anthologies of attitude, value, and other scales. Their usefulness in-
heres not only in the many scales they contain, but also in perspicacious critiques that focus on
reliability, validity, and other characteristics of the scales.
Robinson, J. , Rusk. J. , and Head, K. Measures of Political Attitudes. Ann Arbor: Institute for
Available Materials,
Projective Methods,
and Content Analysis
In this chapter we examine personal and societal products as sources of research data.
Personal and societal products are materials, especially verbal materials, produced in the
course of living by individuals and groups. The production of materials is also deliberately
stimulated by scientists to provide measures of variables. This is the use of available
materials, or materials whose creation has been stimulated for scientific purposes, to
measure variables of research interest. The method to obtain variable measures from the
materials is content analysis. If we study documents already written in order to measure,
say, conceptual complexity, we are using available materials. The measures of conceptual
complexity (for each document, perhaps) are obtained from the documents by specifying
and following operational rules that tell us, in effect, whether a document or a part of it
is conceptually complex. Or the rules may even specify the assignment of point-scale
numbers to parts of documents. Such rules and procedures are the stuff of content analy-
sis.On the other hand, we may ask children to write stories on given subjects in order to
measure a variable or variables of interest. This is material whose production is deliber-
ately stimulated for our purpose, and content analysis is similarly applied to the material.
The method can be called projective. We will see what this means shortly. Our purpose,
as always, it methods and their use in behavioral research. It is not to master
to understand
the methods. We
do little more, in other words, than open up possibilities for the student
of research whose task now is essentially to grasp and comprehend the nature of the
complex methods we study rather than to master the methods and their use.'
'In the first and second editions of this book, the methods of this chapter and the following two chapters
were discussed in considerably greater detail than in this edition. Detailed explanations are no longer possible
because the methods have developed beyond the point of relatively brief description and explanation. Research-
ers who want to use content analysis, for instance, will have to give it special study.
Available Materials, Projective Methods, and Content Analysis • 469
AVAILABLE MATERIALS
colleges, tested implicit hypotheses on the relation between lack of freedom and other
variables.- Much of Beale's source data came from available materials: newspapers,
periodicals, books, public documents, court decisions, and so on.
Although beset with methodological can be deliber-
difficulties, available materials
ately used to test hypotheses. Tetlock, forexample, tested Janis' groupthink hypothesis by
analyzing the public statements of makers of American foreign policy. * He used content
analysis, which we examine later, to identify in policy makers' public statements tenden-
cies to treat policy-relevant information in biased ways and tendencies to evaluate one's
own group members positively and one's opponents negatively. The statements were
obtained from various archives. A more mundane use of available materials is to study
trends and changes in social phenomena by examining governmental statistics."*
Another use of available materials to check on research findings was mentioned earlier
when survey research was discussed: the use of census data to check .sample data. If one
has drawn a random sample of dwellings in a community in order to interview individuals,
the accuracy of one's sample should be checked by comparing sociological data of the
sample, like income, race, and education, with the same data of the most recent census or
with available data in local government offices.
-H. Beale, Are American Teachers Free'' New York: Scribner, 1936.
'1. Janis. Victims of Groupthink. Boston: Houghton Mifflin. 1972. P. Tetlock, '-Identifying Victims of
Groupthink from Public Statements of Decision Makers," Journal of Personaliry ami Social Psychology. 37
(1979). 1314-1324. The essence of the hypothesis is that the intense social pressures toward uniformity in
decision-making groups interfere with cognitive efficiency and moral judgment. ••Groupthink" occurs when
critical analysis takes second place to members' motivations to mamtain group solidarity and to avoid
disunity.
^One of many examples is: Bureau of the Census. Social and Economic Characteristics of the Metropolitan
and Nonmetropol'iian Population: 1977 and 1970. Washington. D.C.: U. S. Dept. of Commerce, Bureau of the
Census, 1978. This is a good source of changes in the relations between sex and race and certain population
variables like occupation and income.
470 • Methods of Observation and Data Collection
Census and other official data — voting lists, housing registration, license registration,
school censuses, and so on —
are also used to help draw samples. To draw a random
sample of a large geographical area is an expensive and difficult job. But it is not too
draw random samples of single, smaller communities. Some school systems,
difficult to
for example, maintain relatively complete and accurate lists of taxpayers or families with
school children. The point is that there are a number of available sources that can be used
for drawing samples. Though none of these sources is perfect since records are kept for dif-
erent purposes, with different degrees of accuracy, by people of different levels of com-
petence, they are better sources of samples than the informed hunches of investigators.
Five or six kinds of available materials seem to be most important for research purposes.
Probably the first place to search and make inquiries is the university library. Modem
librarians are trained professionals in unearthing materials for study. Use them! Census
and registration data are often invaluable (see footnote 4), especially for large field
studies and survey research. Check with a university library, and write the Superintendent
of Documents, U. S. Government Printing Office, Washington, D.C. 20402, and the
Bureau of the Census. Washington, D.C. 20233.
Extremely valuable sources of actual data have been stored in data archives for the use
of researchers. Again, ask university librarians. There are guides available.^ Mental
achievement test scores of school pupils all over the country have been filed in data bases
(see footnote 5). The important fiuman Relations Area Files contain voluminous data on
many contemporary and even ancient societies.^
Other important sources of information and data are county clerks' offices (registra-
tion data, for example), local school districts, state education departments, and the U. S.
Department of Education (Washington, D.C. 20202). Data on foreign countries can ordi-
narily be obtained from the consulates and embassies of the countries. (Write the appropri-
ate embassy in Washington, D.C, for information.) Useful foreign educational and other
information can be obtained from UNESCO, United Nations, New York, N. Y. 10017.
Newspaper files and personnel, school records, and personal documents like letters and
diaries are also useful sources of information and data.^ There are, then, many sources of
research information. The question is how to find them and use them. The above sugges-
tions may help.**
^See; L. Conger, "Data Reference Work with Machine Readable Data Files in the Social Sciences,"
Journal of Academic Librarianship 2 1976), 60-65; Directory of Online Databases, Cuadra Associates. 1983;
. (
V. Sessions, ed.. Directory of Data Bases in the Social and Behavioral Sciences. New York: Science Associ-
ates/International, 1974; T. Li. Social Science Reference Sources: A Practical Guide Westport, Conn.: Green-
wood Press. 1980; E. Sheehy, ed.. Guide to Reference Books, 9th ed. Chicago: American Library Associates,
1976 (is a guide to guides).
Nature and Use of HR.AF Files: A Research and Teaching Guide. New Haven: HRAF.
''See R. Lagace,
G. Mmdock e\a\.,Outlitte of Cidtural Materials. 4th ed. New Haven: HRAF. Inc.. 1971. Such files
Inc.. 1974;
are more and more used in social scientific research, especially in conjunction with computer retrieval and
analysis.
^G. Allport. The Use of Personal Documents in Pychotogical Science. New York: Social Science Research
Council. 1942.
*An excellent guide to available materials is: R. Angell and R. Freedman, "The Use of Documents.
Records. Census Materials and Indices." In L. Festingerand D. Katz. ed<i.. Research Methods in the Behavioral
Sciences. New York: Holt. Rinehart and Winston. 1953, chap. 7. Although old. this is one of the only references
on the use of available tiiaterials in behavioral research. More recent references on data archives are: L. Schoen-
teldt. "Data Archives as Resources for Research. Instruction, and Policy Planning," American Psychologist, 25
(1970), 609-616; F. Bryant and P. Wortman, "Secondary Analysis: The Case for Data Archives." American
P.n-chologisi. 33 (1978). 381-387.
Available Materials, Projective Methods, and Content Analysis • 471
PROJECTIVE METHODS
We project some part of ourselves into everything we do. Watch a man walk. Examine an
artist's drawings. Study a professor's lecture style. Observe a child play with other chil-
dren or with toys and dolls. In all these ways human beings express their needs, their
drives, their styles of life. If we want to know about people, then we can study what they
do and the way they do it.
People also put part of themselves, their work, their attitudes, and their culture in the
materials they create and store. Letters, books, historical records, art objects, artifacts of
all kinds express, if indirectly and often remotely, life, society, and culture.
Values, attitudes, needs, and wishes, as well as impulses and motives, diK projected upon
objects and behaviors outside the individual. A hungry individual may invest inedible
objects with food properties. An individual with conservative social attitudes may see
federal taxes as confiscatory. Each person, then, views the world through his own projec-
tive glasses.
It should be possible to study men's motives, emotions, values, attitudes, and needs
by somehow getting them to project these internal states onto external objects. This potent
idea is behind projective devices of all kinds. A basic principle is that the more unstruc-
tured and ambiguous a stimulus, the more a subject can and will project emotions, needs,
motives, attitudes, and values. The structure of a stimulus, at least from one important
point of view, is the degree of choice available to the subject. A highly structured stimulus
leaves very little unambiguous choice among clear alternatives, as
choice: the subject has
in an objective-type achievement test question. A stimulus of low structure has a wide
range of alternative choices. It is ambiguous: the subject can "choose" his own interpre-
tation.
Another important characteristic of projective methods is their relative lack of objec-
tivity, in the sense that it is much easier for different observers to come to different
conclusions about the responses of the same persons. Recall that one of the powerful
advantages of objective methods was that different observers agree on the scoring of
responses. Projectives, on the other hand, are used precisely because they lack this desira-
ble characteristic. Although different observers can score the same data quite differently,
a serious weakness from the perspective of objectivity, this is a strength from the projec-
tion perspective. All tests and measures involve inference, as we have seen. Projective
tests and measures require large inferential leaps indeed, larger than those of other meth-
ods. Thus their reliability and validity are difficult problems.
A significant virtue of projection and projective methods is that almost anything can
be used as a stimulus. In addition to the well-known projective tests, like the Rorschach
and the TAT — —
which will not be stressed in this chapter the principle of projection can
be used in many other ways; drawing pictures, writing essays, using finger paints, playing
with dolls and toys, role playing, handwriting, telling stories in response to vague stimuli,
associating words to other words, interpreting music.
Projective devices are among the most imaginative and significant creations of
psychology. There is little doubt of their power, flexibility, and catholicity. But
and the "but" is a large one —
can they be used in scientific research? Are their
rather shaky reliability and validity inherent obstacles to their profitable use in re-
search?
472 • Methods of Observation and Data Collection
Association Techniques
These techniques require the subject to respond, at the presentation of a stimulus, with the
first thing that comes to mind. The most famous device of this kind is the Rorschach test.
Construction Techniques
Here the focus is on the product of the subject. The subject is required to produce, to
construct, something at direction, usually a story or a picture. The stimulus can be simple,
like asking children to tell a story about what happened to them yesterday, or complex,
like the well-known Thematic Apperception Test (TAT). Generally some sort of standard
stimulus is used.
Perhaps the most highly developed use of the construction technique has been the
study and measurement of achievement motivation —
need achievement, or /; ach by —
McClelland and his colleagues." Subjects were shown four pictures that could be inter-
preted in a variety of ways. One of these, for example, was a boy sitting before a book,
leaning on his left hand, and looking off into space. (Notice the ambiguity and my
interpretation; he may be looking at something specific.) They were asked to write, in
'G. Lindzey. "On the Classification of Projective Techniques." Psychological Bulletin. 56 ( 1959), 158-
168. Choice or ordering techniques (see below) are probably not true projective methods. Rather, they seem to
be a means of objectifying projective devices. For example, an objective rating scale can be used with the TAT
(Thematic Apperception Test), a test in which vague pictures are used as stimuli. The approach taken in this
book and in this chapter that the various devices and methods are for the purpose of measuring variables seems
not to be stressed in the literature. For an attempt to discuss the use of projectives in research, see J. Zubin, L.
Eron, and F. Schumer, An Experimental Approach to Projective Techniques. New York: Wiley. 1965. For a
thorough review of projective devices, including history, see H. Sargent. "Projective Methods: Their Origins,
Theory, and Application in Personality Research." Psychological Bulletin. 62 (1945). 257-293.
'"J. Getzels and P. Jackson. Creativity and Intelligence. New York: Wiley, pp. 199-200. 224-225.
"D. McClelland et al.. The Achievement Motive. New York: Appleton. 1953.
Available Materials, Projective Methods, and Content Analysis • 473
about a minute, stories about the pictures. The stories were then scored for achievement
imagery and motivation.'" Other variables were then correlated with n achievement.
In a study of religious beliefs, Cline and Richards used TAT-typc pictures with reli-
gious overtones and had their subjects tell stories about what was happening, what the
people were thinking and feeling, and what would be the outcome.'* The stories were
scored for overall religious commitment, religious conllict, and so on. The average cor-
relation between the projective measures of religious commitment and other measures of
such commitment obtained from depth interviews and questionnaire items was .66, quite
high tor such measures.
Veldman and Menaker invented a remarkably simple yet effective projective device of
the construction kind.''* They simply told their subjects, teacher trainees: "Tell four
fictional stories about teachers and their experiences." Judges who read the stories of
different samples agreed on six general areas of content: Structural Features, Interest
Qualities, Emotional Features, Characters and Activity, Role Identity, and Self- Ability
and Competence. Within these areas were subareas, like coherence, realism, and general
adjustment. A cluster analysis (a method that determines which variables go together, or
are correlated with each other) revealed three basic clusters, which corresponded to pro-
fessional aspects of teaching, problem-solving, and affective aspects of stories.
Completion Techniques
These methods require simple responses: the subject chooses from among several alterna-
tives, as in a multiple-choice item test, the item or choice that appears most relevant,
correct, attractive, and so on. One might wish to measure need for achievement or atti-
tudes toward blacks, for example, and present subjects with sets of pictures or sets of
details are given.Moreover, the work has been carried over into other areas. McClelland has even raised the
McClelland,
achievement motive of Indian businessmen by teaching them to score stories for n achievement: D.
•'Toward a Theory of Motive Acquisition," American Psychologist. 20 (1965), 321-333. In addition, McClel-
land has used available matenals to test his ideas cross-culturally; D. McClelland. The
Achieving Society. New
Expressive Techniques
Expressive projective techniques are similar to construction techniques: subjects are re-
quired to form some sort of product out of raw material. But the emphasis is on the
—
manner in which they do this the end product is not important. With construction meth-
ods, the content, and perhaps the style, of the story or other product are analyzed. With,
say, finger painting or play therapy, it is the process of the activity and not the end product
that is important. Subjects express their needs, desires, emotions, and motives through
working with, manipulating, and interacting with materials, including other people, in a
manner or style that uniquely expresses personality.
The principal expressive methods are play, drawing or painting, and role playing.
There are, however, other working with clay, handwriting, games, and so
possibilities:
on. (I suspect that in the near future the expressive possibilities of microcomputers will be
used.) The discussion that follows is limited to play techniques and role playing.
In the research use of play techniques, a child is brought into the presence of a variety
'^
of toys, often dolls of some kind. He may be told that a set of dolls is a family and that
he should play with them and tell a story about them. In their study of the effect of the
presence of mothers on the aggressiveness of children. Levin and Turgeon used doll play
to measure aggression. '^ Aggression was defined as acts that hurt, irritate, injure, punish,
frustrate, or destroy the dolls or equipment. Two scores of aggression were used: total
number of aggressive units per session and percent aggression (number of aggressive units
divided by total number of units).
Role playing is the acting-out of a personal or social situation for a brief period by two
or more individuals who have been assigned specific roles. It holds considerable promise
as an experimental method and as an observation tool of behavioral research, though its
research use has been limited. The investigator uses an observation system (see Chapter
'**
31) to measure variables. Or experimental variables can be manipulated using the tech-
nique. Group processes and interpersonal interaction, especially, can be conveniently
studied. Hostility, prejudice, and many other variables can be measured. It has been the
experience of role players that they say things they would rarely say under ordinary
circumstances. They "come out" with things that surprise even themselves. The method,
in otherwords, tends to bring out motives, needs, and attitudes that are below the social
surface. While potent for bringiiig out emotions and needs, however, it must be recog-
nized that role playing is perhaps an approach more suited to therapeutic and teaching
situations than it is to empirical research.''' The reason is that it is difficult to control
role-playing situations so that research variables can be reliably and validly measured. In
""H. Levin and E. Wardwell, "The Research Uses of Doll Play," Psvchological Bulletin. 59 (1962),
27-56.
'^H. Levin and V. Turgeon. "The Influence of the Mother's Presence on Children's Doll Play Aggression,"
Journal of Abnormal and Social Psychology. 55 (1957), 304-308.
"J. Mann, "Experimental Evaluations of Role Playing." Psychological Bulletin. 53 (1956), 227-234.
"For a skeptical view of role playing as a research method, see C. Spencer, "Two Types of Role Playing:
Threats to Internal and External Validity." American Psychologist. 33 (1978), 265-268.
Available Materials, Projective Methods, and Content Analysis • 475
VIGNETTES
Various ingenious methods have been invented to measure variables in realistic but unob-
trusive ways.-' One of these methods, called vignettes, has considerable promise in be-
havioral research. Vignettes are brief concrete descriptions of realistic situations so con-
structed that responses to them —
in the form of rating scales, say —
will yield measures of
variables. As usual, examples will perhaps be more enlightening than formal definition.
with brief classroom episodes, vignettes that seemed democratic but that actually depicted
phony ••democratic"" behaviors.-- The instrument elicited the responses he predicted it
would.
Alexander and Becker, following factorial design principles, constructed systemati-
cally modified descriptions of rape crimes in vignettes.-' What this means is that the
vignettes were so constructed that the sex, age, race, marital status, and so on of the rape
victims were systematically varied to assess people's reactions to rape crimes. (The as-
sumption was that respondents' assignment of responsibility to victims and assailants
would vary with the descriptions of the victims.)
-"I. Steiner and W. Field. "Role Assignment and Interpersonal Influence," 7ourna/o/>lb«orma/ am/ 5oaa/
Pswhologw 61 (1960), 239-245.
Chicago: Rand
-'
See E. Webb et al., Unobtrushe Measures: Nonreaaive Research in the Social Sciences.
McNallv, 1966, pp. 75ff.
--E. Pedhazur. 'Pseudoprogressivism and Assessment of Teacher Behavior,"
Educational and Psychologi-
In her study of students' needs and ratings of teachers, Tetenbaum constructed and
used vignette portrayals of twelve teachers. ""* The vignettes contained descriptions of
teachers whose orientations were directed toward intellectual challenge and sustained
effort in knowledge acquisition, toward support of students and facilitative of interper-
sonal relationships, and toward the encouragement of competitiveness and assertive lead-
ership of their students. Tetenbaum predicted that student needs for cognitive structure
and order, achievement, affiliation,and so on would be related to the different kinds of
teachers portrayed in the vignettes, for example, students who had strong needs for affilia-
tion would rate highly those teachers whose orientation was supportive of students.
It should be obvious that vignettes are a combination of expressive and objective ideas
and projective methods. As such, they can be expected to be increasingly used in psycho-
logical and educational research because, constructed with imagination and ingenuity,
they can be interesting to subjects, can measure complex variables, can be good approxi-
mations to realistic psychological and social situations. They can also be unobtrusive
approaches to sensitive information about the subjects {e.g.. prejudiced attitudes, needs,
and sexual preferences).
independent and competent judges must agree on the scoring and interpretation of the data
yielded by an observation method. How can this be done with projective methods? Sup-
pose an investigator is trying to measure creativity. She shows children a picture and asks
them to write a story about it. After the stories are written, she asks judges, whom she has
already trained, to read the stories. In addition, she constructs a graphic seven-point rating
""T. Tetenbaum. "The Role of Student Needs and Teacher Orientations in Student Ratings of Teachers,"
American Educational Research Journal. 12 (1975). 417-433.
Available Materials, Projective Methods, and Content Analysis •477
scale (or a numerical rating scale) of five items. Each item epitomizes a criterion of
creativity as she defines it. Say that two of them arc orifiiiuility and unusual approach.
She now asks her judges to rate the stories of all the children using the rating scale. To the
extent that the ratings correlate highly, to this extent she has achieved objectivity. Other
-"^
objective procedures can similarly be applied to the products of projective tests.
To sum up, projective methods of observation, when considered as psychometric
instruments and subjected to the same canons and criteria of scientific measurement as
other instruments, and when used with circumspection and care, can be useful tools of
behavioral research. A projective instrument should not be used, however, if you have a
more objective instrument that adequately measures the same variable. Moreover, it is
wise to avoid complex projective techniques, like the Rorschach and the TAT, which
require highly specialized training and a good deal of perhaps questionable interpretation.
(But note the outstanding use of the TAT by McCelland and his colleagues to measure n
achievement and other needs.)
CONTENT ANALYSIS^''
-Mn Chapter 27 we cited Amabile's measurement of creativity using what she called a consensual assess-
ment method: T. Amabile. 'Social Psychology of Creativity: A Consensual Assessment Technique," Journal of
Personalis and Social Psychology. 43 (1982), 997-1013. Her method is a good example of "objectifying the
subjective'" It first uses a global subjective approach that instructs judges to assess the creativity of
products-
surface of some
collages, for example: a collage, recall, is an artistic composition of materials pasted on a
kind— ui/ng their own criteria of what is creative. The judges must have had experience with the products being
judged. After showing that the procedure was reliable, Amabile analyzed the judgments to determine
which
be used in future
objective features of the products predicted the creativity judgments. These features, then, can
measurement and research. Note that somewhere along the line objective procedures are used; for example,
Amabile had the judges rank and rate the collages and analyzed the results quantitatively.
Berelson,
-"The discussion of this section leans heavily on the older treatments of Berelson and Holsti: B.
Addison- Wesley,
"Content Analysis." In G. Lindzey, ed.. Handbook of Social Psychology. Reading, Mass.:
1954, vol I, chap. 13; O. Holsti. "Content Analysis." In G. Lindzey and E. Aronson, eds..
The Handbook of
Psychology. 2ded. Reading, Mass.: Addison-Wesley. 1968, vol. II. chap. 16. A more recent treatment of
Social
variables is; J. Markoff, G. Sha>iro,
content analysis that emphasizes its use as a method of the measurement of
and S. Weitman, "Toward the Integration of Content Analysis and General Methodology." In D.
Heise, ed..
Holsti has more recently published
Sociological Methodology 1975. San Francisco: Jossey-Bass, 1974, chap. I
One of the most important characteristics of content analysis is its general applicabil-
ity, especially now that the use and availability of computers make its application much
easier than it can be used with the productions of projective methods, with
used to be. It
materials deliberately produced for research purposes, and with all kinds of verbal materi-
als. We examine two or three examples from actual research to suggest some of these
uses.
Earlier, we cited Suedfeld and Rank's remarkable study of the success of revolutionary
leaders. ^^ Their hypothesis was that successful revolutionaries were characterized by a
conceptually simple level of functioning before revolution, but that after takeover of the
government, they exhibited greater conceptual complexity. Unsuccessful revolutionary
leaders —that is, those leaders who failed in governing after the revolution but who had
successfully achieved the revolution —
were conceptually simple before the revolution and
remained conceptually simple after the revolution. Success, the dependent variable, was
suitably defined and measured as those individuals who were prominent before the revolu-
tion and who held important posts after the revolution until the end of their constitutional
terms followed by voluntary retirement, or until natural death. Unsuccessful leaders were
those individuals who did not satisfy these requirements. There were nineteen leaders of
five countries of whom eleven were successful and eight were unsuccessful. Among the
former were Oliver Cromwell, George Washington, Stalin, and Mao Tse-tung. Among
the latter were Alexander Hamilton, Trotsky, and Guevara.
Measures of conceptual complexity for each leader were obtained by content analysis
of letters, speeches, and documents using the Paragraph Completion Test.-"* The analysis
of pre- and postrevolutionary complexity scores supported the hypothesis. The important
point in the present context is that the variable conceptual complexity was measured
through content analysis of the communications of revolutionary leaders.
In another productive use of both a projective device and content analysis, Veroff and his
colleagues measured the achievement, power, and affiliation motives of random samples
of American adults in 1957 and again in 1976. "*" They did this by showing the respondents
six pictures that had been selected to yield the best distributions of scores on affiliation,
power, and achievement imagery. They asked the respondents to tell stories about the
pictures —
note the projective device —
and then content analyzed (coded) the responses,
'*P. Suedfeld and A. Rank. "Revolutionary Leaders: Long-Term Success as a Function of Changes in
Conceptual Complexity." Journal of Personality and Social Psychology. 34 (1976). 169-178.
-'H. Schroder, M. Driver, and S. Streufert, Human Information Processing. New York: Holt, Rinehart and
Winston, 1967. The system for assessing conceptual complexity is explained in detail in Appendix 2 of this
book. An example here, however, may help. Using a seven-point scale, a paragraph is scored 6 (high integration
or complexity) if it indicates the simultaneous operation of alternatives and consideration of functional relations
between alternatives (p. 189). In contrast, a paragraph (or subunit thereoO is scored 1 (low complexity) if it is
generated by a fixed rule and no alternatives are expressed. Here is an example of a 1 response: "Rules are made
to be followed. They should not be broken." (p. 190).
. . .
^"J. Veroff, C. Depner. R. Kulka, and E. Douvan. "Comparison of American Motives: 1957 versus 1976,"
Journal of Personality and Social Psychology. 39 1980), 1249-1262. The authors did not use the expression
(
"content analysis" in their report; they used the term "coding," which is the procedure used by the Survey
Research Center, University of Michigan (and elsewhere) to "score" the results of interviews. (See discussion
in chap. 24. above.)
Available Materials, Projective Methods, and Content Analysis • 479
which were taken down verbatim by the interviewers. They thus had motive scores for
each member of the 1957 sample and for each member of the 1976 sample.
Their analysis of the results suggested that disturbing changes had taken place in
American men. There was a decrease in men's affiliation motivation, and no change in
their achievement motivation." Women were more achievement-motivated in 1976 than
they were in 1957, a finding to be expected in the light of the greatly changed conditions
due to the women's liberation movement. Both men's and women's fears of power moti-
vation increased. Veroff et al. say that this reflects increased fear of being controlled by
others, a need for autonomy. This study is notable because it successfully used a projec-
tivemethod and content analysis to measure the motives of large random samples of the
whole country.
The first step, as usual, is to define U. the universe of content that is to be analyzed. In
Veroff et al.'s study, U was all verbal replies to the questions asked the respondents.
Categorization, or the partitioning of U. is perhaps the most important part of the analysis
because a reflection of the theory or hypotheses being tested. It spells out, in effect,
it is
variables of the hypotheses. Veroff et al.'s main categorization seems to have been:
achievement, power, affiliation, that is, verbal utterances that could be categorized under
one or the other subhead. A great deal of thought, work, and care must go into these first
steps.
Units of Analysis
Berelson major units of analysis: words, themes, characters, items, and space-
lists five
and-time measures.'^ The word is the smallest unit. (There can even be smaller units:
letters, phonemes, etc.) It is also an easy unit to work with, especially in computer content
analysis. An investigator may be studying value words in the writings of high school
students. For some reason, he may wish to know the relation between sex or political
preference of parents, on the one hand, and the use of value words, on the other hand. The
word unit may also be useful in studies of reading. U can ordinarily be clearly defined and
^'The reader who wishes to know more about these motives and their conceptualization and measurement
should consult; J. Atkinson, ed.. Motives in Fantasy. Action, and Society. Princeton, N. J.: Van Nostrand,
1958.
"Berelson, op. cit.. pp. 508-509.
480 • Methods of Observation and Data Collection
categorized —
for example, value words and nonvalue words; difficult, medium, and easy
words. Then the words can simply be counted and assigned to appropriate categories.''-^
The theme is a useful though more difficult unit. A theme is often a sentence, a
proposition about something. Themes are combined into sets of themes. The letters of
adolescents or college students may be studied for statements of self-reference . This
would be the larger theme. The themes making this up might be defined as any sentences
that use "T,"" "me," and other words indicating reference to the writer's self. Discipline
is an interesting larger theme. Child training or control is another. Many observers take
field notes in a thematic manner. Here is an example from the field notes of an observer in
a small village in Japan:
Food-Training: . . . Use cajolery in this family — but it varies. Parents will leave something
disliked (if child should be obstreperous about it) out of child's diet.'"*
Informant's first child was fed whenever he cried, and the second child was started out that
.'^
way; but after one month, informant fed second child on schedule. . .
It should be emphasized, as Berelson does, that if the themes are complex, content
analysis using the theme as the unit of analysis is difficult and perhaps unreliable. Yet it
is an important and useful unit because it is ordinarily realistic and close to the original
content.
Character and space-and-time measures are probably not too useful in behavioral
research. The first is simply an individual in a literary production. We might use it in
analyzing stories. The second is the actual physical measurement of content: number of
inches of space, number of pages, number of paragraphs, number of minutes of discus-
sion, and so on.
Like the theme, the item unit is important. It is a whole production: an essay, news
story, television program, class recitation or discussion. Getzels and Jackson used short
autobiographies to measure creativity.-''' The unit was the item, the whole autobiography.
Each autobiography was judged either "creative" or "noncreative." Children can be
asked to write projective stories in response to a picture. The whole story of each child can
be the unit of analysis. Judges can be trained to use a rating scale to assess the creativity of
the stories. Or judges can be trained to assign each story to a creative or noncreative
category.
It is likely that the item as a unit of analysis will be particularly useful in behavioral
research. As long as pertinent criteria for categorizing a variable can be defined, and as
execution. One instructs the computer to detemiine the frequencies of the use of certain words in texts of known
and disputed authorship. Appropriate statistical comparison of known and disputed texts should then settle the
authorship matter. The content analysis is simple; the comparisons and judgments may be complex and difficult,
especially if the authors of known and disputed texts espouse similar values, or have similar writing styles, as
Hamilton and Madison did.
"Center for Japanese Studies. University of Michigan. Okayama Field Station. S63. Niiike. August 25,
1950 (GB). {"863" is the Yale field number; "Niiike" is the Japanese village; "GB" is the observer.)
"CFJS, UM, OFS, 853, Takashima, Dec. 18. 1950 (MFN).
""Getzels and Jackson, op. cit., pp. 99-103.
Available Materials, Projective Methods, and Content Analysis • 481
long as judges can agree substantially in their ratings, rankings, or assignments, then the
item unit is profitable to use. But carelul checks on
and validity must be made.
reliability
Judges can wander trom the criteria, and they can lo.se themselves in the masses of reading
they must do. Yet it is surprising how much agreement can be reached, even for rather
complex material. In judging the creativity of student essays, teacher Judges in Hartsdale,
New York, achieved agreement coefficients of .70 and .80. Here, again, whole essays
were the units.
Quantification
All materials are potentially quantifiable. We can even rank the sonnets of Shakespeare or
the last five piano sonatas of Beethoven in the order of our personal preferences, if
nothing else. It is true that some materials are not as amenable to quantification as certain
other materials. After all, it is much easier to assign numbers to spelling performance than
a large number of essays are involved, they can still be ranked, but a more manageable
system than total ranking can be used, for example, 10 or 1 1 ranks can be made available,
and judges can assign the ranks to the essays.
A third form of quantification is rating. Children's compositions, for example, can be
rated as wholes for degrees of creativity, originality, achievement orientation, interests,
values, and other variables.
Certain conditions have to be met before quantification is justified or worthwhile.
Berelson has spelled out these conditions.
***
Two of his seven conditions should be noted:
(1) to count carefully (or otherwise quantify) when the materials to be analyzed
are
representative, and (2) to count carefully when the category items appear in the materials
in sufficient numbers to justify counting (or otherwise quantifying). The reason for both
conditions is obvious: if the materials are not representative or if the category items are
have materials produced that quantification is possible. If materials cannot meet the crite-
ria, they can be used only for heuristic and suggestive purposes and not
for relating
certainly has not yet reached the sort of stability necessary for routine instruction. To give
some idea of how
computer is used, and to encourage exploration of possibilities, we
the
briefly describe thecomputer system known as "The General Inquirer. "^^ Then we
outline Page's system of grading essays with the computer to illustrate a functioning
system.
The General Inquirer is a set of computer programs geared to the content and statistical
analysis of verbal materials so generalized that it can be applied to a variety of research
problems. Its aim is to free researchers from the details of computer operations and yet to
enable them to use the computer flexibly. It locates, counts, tabulates, and analyzes the
characteristics of "natural text."
The basis of the system is the "dictionary," which is a large set of words (or short
phrases), each word being defined by "tags" or categories. For example, pronouns like /,
me. and mine are tagged "self," and army, church, and administration are tagged "large-
"*" These are called first-order tags; they represent
group. the common or manifest mean-
ing of the dictionary words.'*' Second-order tags represent the connotative meanings of
words: status connotations and institutional contexts, for example. Holsti gives the exam-
ple of a dictionary word "teacher," which is tagged with three meanings: job-role,
higher-status, academic. The first is a first-order tag, the second and third second-order
tags.^-
Special-purpose dictionaries for a particular research problem are stored in the com-
puter. For example, one such dictionary to analyze verbal materials for need achievement
consists of about 800 entries with 14 tags (categories).**-^ The rules for scoring materials
for achievement imagery developed by McClelland and his colleagues** are behind the
dictionary and computer program. Say a child has written a story. The whole story is
punched on cards and fed into the computer. The computer scans the words in the story,
tags those relevant to need achievement, counts them, and does other analyses. In this
case the dictionary consists of words like want. hope, become, gain, compete. The words
want and hope are tagged "need"; words like fame and honor are tagged "success."
The computer needs rules that it can use to identify tag sequences that indicate need
for achievement. A simple example is when the tags "need" and "compete" appear in
^'P. Stone et al., The General Inquirer: A Computer Approach to Content Analysis. Cambridge, Mass.:
M.I.T. Press, 1966, especially chap. 3. See, too, Holsti, op. cil.. pp. 665ff.
*°Ibid.. p. 666.
""Stone et al., op. cit., p. 174.
•"Holsti. op. cil.. p. 666.
"'Stone et al., op. cit.. pp. 191-206.
"McClelland et al., op. cit.
Available Materials, Projective Methods, and Content Analysis • 483
one sentence.'*'' The computer prints out the analysis and, on the basis of rules built into it,
arrives at an overall assessment that a passage contains achievement imagery or does not
contain it.
system two or three times with actual problems. But to try to fill in a few of the gaps, let's
look at Page's highly effective —
and controversial —
computer method of grading es-
says.'*'' In one study. Page analyzed 276 essays written by students in grades 8 through 12.
Four judges graded the essays for overall quality. These judgments were the criterion
infn'niic measures, or "trins," as Page calls them. The computer was programmed to
measure 30 ap/jravimation measures, or "proxes" —
again. Page's expression or vari- —
ous characteristics of the essays: average sentence length, length of essay in words, use of
uncommon words, number of commas, and so on. Each of these measures was used in an
equation to predict the human ratings of the essays provided by the four judges. What
amounts to a coefficient of correlation between a weighted average of the 30 measures and
the human judgments was a remarkable .71 ! Subsequent work confirmed the ability of the
A Cautionary Note
Available materials, projective devices, and content analysis should not be used indis-
criminately. It should be obvious that they are not easy to use and that, in most cases,
there are better and easier ways to measure variables. If it is possible to use an attitude
scale, then trying to measure attitudes with a projective device or the content analysis of
available materials seems pointless and wasteful. There are research situations, however,
in which certain variables are difficult or impossible to measure. One thinks of trying to
measure ethnocentric attitudes in a veterans association. An attitude scale would probably
be rejected, but an interview in which opportunities are given respondents to expand on
their beliefs about their own groups and other groups might be possible. The results of the
interviews can then be content analyzed for expressions of ethnocentricism.
Study Suggestions
1. The following references are useful beginnings in the use of available materials and content
analysis.
et al., p. 201.
•"Stone
""E. Page. -Grading Essays by Computer: Progress Report." In Invitational
Conference on Testing Prob-
87-100.
lems, 1966. Princeton, N. J.: Educational Testing Service, 1966, pp.
484 • Methods of Observation and Data Collection
Berelson and Holsti chapters (footnote 26). These chapters are authoritative.
Stone book, chaps. 1 and 2 (footnote 39).
Angell, R., and Freedman, R. "The Use of Documents, Records, Census Materials, and
Indices." In L. Festinger and D. Katz, eds.. Research Methods in the Behavioral Sci-
ences. New York; Holt, Rinehart and Winston, 1953, chap. 7. One of the only references
on available materials.
HOLSTl, O. Content Analysis for the Social Sciences and Humanities. Reading, Mass.: Addi-
son- Wesley, 1969. A competent text.
Markoff, J.. Shapiro, G., and Weitman, S. (footnote 26). Very good general chapter that
emphasizes the use of content analysis to measure variables. Also has a good discussion of
the The General Inquirer.
2. Read one or two of the following studies. They show what can be done with knowledge and
ingenuity.
Alper, T., Blane, H., and Abrams, B. "Reactions of Middle and Lower Class Children to
Finger Paints as a Function of Class Differences in Child-Training Practices." Journal of
Abnormal and Social Psychology, 5 1 (1955), 439-448. A most ingenious and fruitful use
of finger paints.
DeCharms, R., and Moeller, G. "Values Expressed in American Children's Readers; 1800-
1950." Journal of Abnormal and Social Psychology. 64 (1962), 136-142. Study of the
values and motives expressed in 150 years of children's readers. The authors tested, among
other things, the engaging hypothesis that achievement motivation is related to inventive-
ness as expressed by number of patents issued.
Hiller, J., Marcotte, D., and Martin, T. "Opinionation, Vagueness, and Specificity-Dis-
tinction: Essay Traits Measured by Computer." American Educational Research Journal,
6 (1969), 271-286. Interesting use of computer analysis of essays; measured three catego-
ries of writing style: opinionation-exaggeration, vagueness, specificity-distinctions.
McClelland book (footnote 11). Describes the theoretical and practical development of the
famous need-for-achievement measure.
Skinner, B. "The Alliteration in Shakespeare's Sonnets: A Study in Literary Behavior."
Psychological Record, 3(1939), 186-192. Used probability theory and content analysis to
study Shakespeare's use of alliteration. Among other things, found that the line contained,
on the average, 5.036 syllables, very close to pentameter!
Tetlock, p. (footnote 3). Interesting but technically difficult.
Tetlock, "Pre- to Postelection Shifts in Presidential Rhetoric: Impression Management or
p.
Cognitive Adjustment?" Journal of Personality and Social Psychology. 41 (1981), 207-
212. Used content analysis of presidential policy statements before and after election to
measure conceptual complexity.
Whiting, J., and Child, I. Child Training and Personality. New Haven: Yale University
Press, 1953. Theoretically oriented content analysis study that fruitfully used the Human
Relations Area Files (see footnote 6) to test psychoanalytic hypotheses.
Winter, D., Alpert, R., and McClelland, D. "Ihe Classic Personal Style." Journal of
Abnormal and Social Psychology, 67 1963), 254-265. Content analyzed stories to detect
(
basic "themes" of a private secondary school and imaginative processes of its students.
Fascinating.
Winter, D., and McClelland, D. "Thematic Analysis; An Empirically Derived Measure of
the Effects of Liberal Arts Education." Journal of Educational Psychology, 70 (1978),
8-16. A modem and imaginative study of an old educational problem.
3. A useful class project would be for a class committee to find sources of available data. One
huge source of educational data, for instance, is the data bank of Project Talent with its more than
500 variables and 400,000 secondary school students.'*' The National Center for Educational Statis-
•'See J. Flanagan et al., Project Talent: Five Years After High School. Palo Alto: American Institutes for
Research and University of Pittsburgh, 1971.
Available Materials, Projective Methods, and Content Analysis • 485
tics (U. S. Depi. of Education) releases data on more than 50,000 upper high-school students. For
example, in June 1982. the data tape and codebook for the Twin and Sibling File, which is part of
High School and Beyond, a national longitudinal study, was released. Another important source of
cultural data is the Human Relations Area Files (footnote 6). {Note: A good source for this project is
Schoenfeldt's article on data archives, which gives the names, addresses, and brief descriptions of
major data archives in the United States and certain other countries — see footnote 8.)
Chapter 3 1
Observations
of Behavior
and Sociometry
Everyone observes the actions of others. We look at other persons and listen to them talk.
We infer what others mean when they say something, and we infer the characteristics,
motivations, feelings, and intentions of others on the basis of these observations. We say,
, '
"He is a shrewd judge of people meaning that his observations of behavior are keen and
'
that we think his inferences of what lies behind the behavior are valid. This day-to-day
kind of observation of most people, however, is unsatisfactory for science. Social scien-
tists must also observe human behavior, but they are dissatisfied with uncontrolled obser-
vations. They seek reliable and objective observations from which they can draw valid
inferences. They treat the observation of behavior as part of a measurement procedure:
they assign numerals to objects, in this case human behavioral acts or sequences of acts,
according to rules.
This seems simple and straightforward. Yet evidently it is not: there is much contro-
versy and debate about observation and methods of observation. Critics of the point of
view that observations of behavior must be rigorously controlled —
the point of view
espoused in this chapter and elsewhere in this book —
claim that it is too narrow and
artificial. must be
Instead, say the critics, observations must be naturalistic: observers
immersed in ongoing and natural situations and must observe behavior as it
realistic
occurs in the raw, so to speak. As we will see, however, observation of behavior is
extremely complex and difficult.
Basically, there are two modes of observation: we can watch people do and say things
and we can ask people about their own actions and the behavior of others. The principal
ways of getting information are either by experiencing something directly, or by having
someone tell us what happened. In this chapter we are concerned mainly with seeing and
Observations of Behavior and Sociometry • 487
hearing events and observing behavior, and solving the scientific problems that spring
from such observation. We also examine, if briefly, a method for assessing the interac-
tions and interrelations of group members: sociometry. Sociometry is a special and valua-
ble form of observation; group members who of course observe each other record their
reactions to each other so that researchers can assess the sociometric status of groups.
The Observer
The major problem of behavioral observation is the observer himself. One of the difficul-
ties with the interview, recall, is the interviewer, because he is part of the measuring
instrument. This problem is almost nonexistent in objective tests and scales. In behavioral
observation the observer is both a crucial strength and a crucial weakness. Why? The
observer must digest the information derived from observations and then make inferences
about constructs. He observes a certain behavior, say a child striking another child.
Somehow he must process this observation and make an inference that the behavior is a
manifestation of the construct "aggression" or "aggressive behavior," or even "hostil-
ity." The strength and the weakness of the procedure is the observer's powers of infer-
ence. If it were not for inference, a machine observer would be better than a human
observer. The strength is that the observer can observed behavior to the con-
relate the
structs or variables of a study; he brings behavior and construct together. One of the
recurring difficulties of measurement is to bridge the gap between behavior and con-
struct.'
The basic weakness of the observer is that he can make quite incorrect inferences from
observations. Take two extreme cases. Suppose, on the one hand, that an observer who is
strongly hostile to parochial school education observes parochial school classes. It is clear
that his bias may well invalidate the observation. He can easily rate an adaptable teacher
as somewhat inflexible because he perceives parochial school teaching as inflexible. Or he
may judge the actually stimulating behavior of a parochial school teacher to be dull. On
the other hand, assume that an observer can be completely objective and that he knows
nothing whatever about public or parochial education. In a sense any observations he
makes will not be biased, but they will be inadequate. Observation of human behavior
requires competent knowledge of that behavior, and even of the meaning of the behavior.
There is, however, another problem; the observer can affect the objects of observation
simply by being part of the observational situation. Actually and fortunately, this is not a
severe problem. Indeed, it is more of a problem to the uninitiated, who seem to believe
that people will act differently, even artificially, when observed. Observers seem to have
little effect on the situations they observe. ~ Individuals and groups seem to adapt rather
In their excellent chapter on observation in classrooms. Medley and Mitzel say that the observer should
'
use the least judgment possible: only a judgment "needed to perceive whether the behavior has occurred or
not." This means, of course, the least inference possible. While their argument is well taken, it is perhaps too
strong. D.Medley and H. Mitzel, "Measuring Classroom Behavior by Systematic Observation." In N. Gage,
ed..Handbook of Research in Teaching. Skokie. 111.: Rand McNally, 1963, chap. 6 (see pp. 252-253). K.
Weick, "Systematic Observational Methods." In G. Lindzey and E. Aronson, eds.. The Handbook of Social
Psychology. 2d ed. Reading, Mass.: Addison-Wesley. 1968, vol. II, chap. 13. p. 359. supports Medley and
Mitzel's view. At the same time, he argues for a more active role of the observer, as Cannell and Kahn argued
for a more active role of the interviewer (see Chapter 28).
'R. Heyns and R. Lippitt. "Systematic Observational Techniques." In G. Lindzey, ed.. Handbook of
Social Psychology. Cambridge, Mass.: Addison-Wesley, 1954, vol. I, p. 399; Weick, op. cii., pp. 369ff.
488 • Methods of Observation and Data Collection
quickly to an observer's presence and to act as they would usually act. This does not mean
that the observer cannot have an effect. It means that if the observer takes care to be
unobtrusive and not to give the people observed the feeling that judgments are being
made, then the observer as an influential stimulus is mostly nullified.
On the surface, nothing seems more natural when observing behavior than to believe that
we are measuring what we say we are measuring. When an interpretative burden is put on
the observer, however, validity may suffer (as well as reliability). The greater the burden
of interpretation, the greater the validity problem. (This does not mean, however, that no
burden of interpretation should be put on the observer.)
A simple aspect of the validity of observation measures is their predictive power. Do
they predict relevant criteria dependably? The trouble, as usual, is in the criteria. Indepen-
dent measures of the same variables are rare. Can we say that an observational measure of
teacher behavior is valid because it correlates positively with superiors' ratings? We might
have an independent measure of self-oriented needs, but would this measure be an ade-
quate criterion for observations of such needs?
An important clue to the study of the validity of behavioral observation measures
would seem to be construct validity. If the variables being measured by an observational
procedure are embedded in a theoretical framework, then certain relations should exist.
Do they indeed exist? Suppose our research involves Bandura's self-efficacy theory"* and
that we have constructed an observation system whose purpose is to measure performance
competence. The theory says, in effect, that perceived self-efficacy, or the self-perception
of competence, affects the competence of a person's actual performance: the higher one's
self-efficacy, the higher the performance competence. * If we find that self-perception of
competence and measures of actual observed competence of doing certain prescribed tasks
is positive and substantial, then the hypothesis derived from the theory is supported. But
to chance agreement and thus needs correction. Perhaps the safest course to follow is to
use different methods of assessing reliability just as we would with any measures used in
behavioral research: agreement of observers, repeat reliability, and the analysis of vari-
ance approach.^
'a. Bandura. "Self-Efficacy Mechanism in Human Agency," American Psychologist, 37 (1982), 122-147.
*lbid.. p. 122.
'Medley and Mitzel, op. cit.. pp. 309ff., give a thorough exposition of the reliability of ratings in an
analysis of variance framework. But their discussion is difficult, requiring considerable statistical background.
(See chap. 26, above.) Hollenbeck discusses reliability of observations when measures are nominal; A. Hol-
lenbeck, ""Problems of Reliability in Observational Research." In G. Sackett, ed.. Observing Behavior: Vol. 2.
Data Collection and Analysis Methods. Baltimore: University Park Press, 1978, chap. 5 (pp. 79-98). See, also,
Weick. op. cit.. pp. 403-406, and G. Rowley, "The Reliability of Observational Measures." American Educa-
tional Research Journal, 13 (1976). 51-59. Assessing reliability and agreement among observers are especially
difficult problems of direct observation because the usual statistics depend on the assumption that measures are
independent — and they are often not independent. It is likely that approaches to these problems will change
radically in the coming decade with the rapid development of multivariate methods and time-series analysis and
the availability of computer programs to expedite both recording and analysis of observational data. Such
programs should become available for microcomputers and consequently be convenient and cheap (though
perhaps not easy to use).
Observations of Behavior and Sociometry • 489
ative behavior is distinguished from other kinds of behavior. This means that we must
provide the observer with some form of tiperationa! definition of the variable being meas-
ured; we must define the variable behaviorally.
Categories
The fundamental task of the observer is to assign behaviors to categories. From our earlier
work on partitioning, recall that categories must be exhaustive and mutually exclusive. To
satisfy the exhaustiveness condition, one must first define U. the universe of behaviors to
be observed. In some observation systems, this is not hard to do. McGee and Snyder,
testing the hypothesis that people who salt their food before they taste it perceive control
of behavior as being more from within the individual (dispositional control) than from
the environment (situational control), simply observed subjects' salting of food in res-
taurants." In other observation systems it is more difficult. Many or most of the observa-
tion systems cited in the huge anthology of observation instruments. Mirrors for Behav-
ior, arecomplex and hardly easy to use.''
keeping with the emphasis of this book that the purpose of most observation is to
In
measure variables, we cite a classroom observation system from the highly interesting,
even creative, work of Kounin and his colleagues." The system reported is more complex
than the salt-tasting observation system but much less complex than many classroom
observation systems. The variable observed was task-involvement, which was observed
by videotaping 596 lessons and then observing playbacks of the tapes to obtain the in-
volvement measures. These measures were categorized as high task-involvement and low
task-involvement. The authors also measured continuity in lessons by creating categories
that reflected greater or lesser continuity in lessons. When the children's behavior was
observed, they used the categories to record the pertinent observed behaviors.
Units of Behavior
What units to use in measuring human behavior is still an unsettled problem. Here one is
often faced with a conflict between reliability and validity demands. Theoretically, one
can attain a high degree of reliability by using small and easily observed and recorded
units. attempt to define behavior quite operationally by listing a large number of
One can
behavioral acts, and can thus ordinarily attain a high degree of precision and reliability.
Yet in so doing one may also have so reduced the behavior that it no longer bears much
resemblance to the behavior one intended to observe. Thus validity may be lost.
On the other hand, one can use broad "natural" definitions and perhaps achieve a
high degree of validity. One might instruct observers to observe cooperation and define
Personality and
•"M. McGee and M. Snyder, 'AUnbution and Behavior: Two Field Studies." Journal of
Social Psychology. 32 (1975). 185-190. The correlation between food salting and control attribution
was .71.
^A. Simon and E. Boyer, Mirrors for Behavior. Philadelphia: Research for Better Schools, 1970. This
work has 14 volumes of behavior observation instruments! Most of these, 67 of 79, are for educational observa-
tions. Readers who intend to use behavior observation in their research should
consult these volumes, especially
Kounin and P. Doyle. "Degree of Continuity of a Lesson's Signal System and the Task Involvement of
*J.
Children," Journal of Educational Psychology. 67 1975), 159-164; J. Kounin and P. Gump.
(
"Signal Systems
expected that they could validly assess behavior as cooperative and uncooperative by
using this definition. Such a broad, even vague, definition enables the observer to capture,
if he can, the full flavor of cooperative behavior. But its considerable ambiguity allows
differences of interpretation, thus probably lowering reliability.
Some researchers who are strongly operational in their approach insist upon highly
specific definitions of the variables observed. They may list a number of specific behav-
iors for the observer to observe. No others would be observed and recorded. Extreme
approaches like this may produce high reliability, but they may also miss part of the
essential core of the variables observed. Suppose ten specific types of behavior are listed
for cooperativeness. Suppose, too, that the universe of possible behaviors consists of 40
or 50 types. Clearly, important aspects of cooperativeness will be neglected. While what
is measured may be reliably measured, it may be quite trivial or partly irrelevant to the
variable cooperativeness.
This is the molar-molecular problem of any measurement procedure in the social
sciences. The molar approach takes larger behavioral wholes as units of observation.
Complete may be specified as observational targets. Verbal behavior may
interaction units
be broken down
complete interchanges between two or more individuals, or into
into
whole paragraphs or sentences. The molecular approach, by contrast, takes smaller seg-
ments of behavior as units of observation. Each interchange or partial interchange may be
recorded. Units of verbal behavior may be words or short phrases. Molar observers start
with a general broadly defined variable, as given earlier, and observe and record a variety
of behaviors under the one rubric. They depend on experience and knowledge to interpret
the meaning of the behavior they observe. Molecular observers, on the other hand, seek to
push their own experience, knowledge, and interpretation out of the observational picture.
They record what they see — and no more.
Observer Inference
Observation systems differ on another important dimension: the amount of inference re-
little inference. The observer
quired of the observer. Molecular systems require relatively
simply notes that an individual does or says something. For example, a system may
require the observer to note each interaction unit, which may be defined as any verbal
interchange between two individuals. If an interchange occurs, it is noted; if it does not
occur, it is not noted. Or a category may be "Strikes another child." Every time one child
strikes another it is noted. No inferences are made in such systems if, of course, it is —
ever possible to escape inferences (for example, "strikes"). Pure behavior is recorded as
nearly as possible.
Observer systems with such low degrees of observer inference are rare. Most systems
require some degree of inference. An investigator may be doing research on board of
education behavior and may decide that a low inference analysis is suited to the problem
and use observation items like "Suggests a course of action," "Interrupts another board
member," "Asks a question," "Gives an order to superintendent," and the like. Since
such items are comparatively unambiguous, the reliability of the observation system
should be substantial.
Systems with higher degrees of inference required of the observer are more common
and probably more useful in most research. The high inference observation system gives
the observer labeled categories that require greater or lesser interpretation of the observed
Observations of Behavior and Sociometry • 491
attempts by an individual to show intellectual (or other) superiority over other individuals,
with little recognition of group goals and the contributions of others. This will of course
require a high degree of observer inference, and observers will have to be trained so that
there is agreement on what constitutes dominant behavior. Without such training and
agreement —
and probably observer expertise in group processes —
reliability can be en-
dangered." Similar remarks are pertinent when we try to measure many psychological and
sociological variables: cooperation, competition, aggressiveness, democracy, verbal apti-
tude, achievement, and social class, for example.
It is not possible to make flat generalizations on the relative virtues of systems with
different degrees of inference. Probably the best advice to the neophyte is to aim at a
medium degree of inference. Too vague categories with too little specification of what to
observe put an excessive burden on the observer. Different observers can too easily put
different interpretations on the same behavior. Too specific categories, while they cut
down ambiguity and uncertainty, may tend to be too rigid and inflexible, even trivial.
Better than anything else, the student should study various successful systems, paying
special attention to the behavior categories and the definitions (instructions) attached to
the categories for the guidance of the observer.
Generality or Applicability
group, can be categorized into one of twelve categories: "shows solidarity," "agrees,"
"asks for opinion," and so on. The twelve categories are grouped in three larger sets:
social-emotional-positive; social-emotional-negative; task-neutral.
Some systems, however, were constructed for particular research situations to mea-
sure particular variables. The salting food example, above, is quite specific, hardly appli-
cable in other situations. The Kounin and Doyle system, while specifically constructed for
Kounin's research, can be applied in many classroom situations. Indeed, most systems
devised for specific research problems can probably be used, often with modification, for
other research problems.
I emphasize that "small" observation systems can be used to measure specific
want to
variables. Suppose, for instance, that the attentiveness of elementary school pupils is a
key variable in a theory of school achievement. Attentiveness (as a trait or habit) in and of
itself has little effect on achievement: let's say the correlation is zero. But, the theory
claims, it is a key variable because, with a certain method of teaching, it interacts with the
method and has a pronounced indirect effect on achievement. Assuming that this is so, we
must measure attentiveness. It seems clear that we will have to observe pupil behavior
while the method in question and a "control" method are used. In such a case, we will
have an observation system that focuses on attentiveness. In assessing the
to find or devise
influence of classroom environment, for example, Keeves found it necessary to measure
428-432 for discussion of biases in observation, and suggested methodological solutions for minimizing the
effects of bias.Another valuable source is Heyns and Lippitt. op. cit.. pp. 396-403.
'"R. Bales. Interaction Process Analysis. Reading. Mass.: Addison- Wesley. 1951.
492 • Methods of Observation and Data Collection
attentiveness.
'
' He did this by observing students who were required to attend to tasks
prescribed by the teacher. Scores that indicated attentiveness or the lack of it were as-
signed. This "'small"" observation system was reliable and apparently valid. It is likely
that such targeted systems will be increasingly used in behavioral research, especially in
education.
Sampling of Behavior
The last characteristic of observations, sampling, is, strictly speaking, not a characteristic.
It is way of obtaining observations. Before using an observation system in actual re-
a
search, when and how the system will be applied must be decided. If classroom behaviors
of teachers are to be observed, how will the behaviors be sampled? Will all the specific
behaviors in one class period be observed, or will specified samples of specified behaviors
be sampled systematically or randomly? In other words, a sampling plan of some kind
must be devised and used.
There are two aspects of behavior sampling: event sampling and time sampling.^^
Event sampling is the selection for observation of integral behavioral occurrences or
events of a given class. '^ Examples of integral events are temper tantrums, fights and
quarrels, games, verbal interchanges on specific topics, classroom interactions between
teachers and pupils, and so on. The investigator who is pursuing events must either know
when the events are going to occur and be present when they occur, as with classroom
events, or wait until they occur, as with quarrels.
Event sampling has three virtues; One, the events are natural lifelike situations and
thus possess an inherent validity not ordinarily possessed by time samples. Two, an
integral event possesses a continuity of behavior that themore piecemeal behavioral acts
of time samples do not possess. one observes a problem-solving situation from begin-
If
ning to end, one is witnessing a natural and complete unit of individual and group behav-
ior. By so doing, one achieves a whole and realistic larger unit of individual or social
behavior. As we saw in an earlier chapter when field experiments and field studies were
discussed, naturalistic situations have an impact and a closeness to psychological and
social reality that experiments do not usually have.
A third virtue of event sampling inheres in an important characteristic of many behav-
ioral events: they are For example, one may be interested
sometimes infrequent and rare.
in decisions made in administrative or legislative meetings. Or one may be interested in
the ultimate step in problem solving. Teachers" disciplinary methods may be a variable.
" JKeeves, Educational Environment and Student Achievement Melbourne: Australian Council for Educa-
.
tional Research, 1972, pp. 62-65. This study also used a larger carefully conceived and constructed observation
system: see pp. 89-100. The focus of the study, however, was on "process" variables: achievement press (e.g.,
completion of homework), work habits and order, affiliation and warmth in the classroom, and so on. An
influential school of thought in educational research emphasizes the importance of climate. Most of the measure-
ment of climate, however, is accomplished with questionnaires that measure climate by asking questions of
students and only rarely by direct observation. For a good review see: C. Anderson, "The Search for School
Climate: A Review of the Research," Review of Educational Research, 52 (1982), 368-420. Another strong
influence has been family environment research. Marjoribanks outlines the background and origins of this
research: K. Marjoribanks, Families and Their Learning Environment: An Empirical Analysis. London: Rout-
ledge & Kegan Paul. 1979, chap. 2.
'-See H. Wright, "Observational Child Study." In P. Mussen. ed. Handbook of Research Methods in Child
.
Development. New York: Wiley. 1960. chap. 3. For a review of lime sampling, see R. Arrington. "Time
Sampling in Studies of Social Behavior: A Critical Review of Techniques and Results with Research Sugges-
tions," Psychological Bulletin. 40 (1943), 81-124.
"Wright, op. cit., p. 104.
Observations of Behavior and Sociometry • 493
Such events and many others are relatively infrequent. As such, they can easily be missed
'"*
by time sampling; they therefore require event sampling.
Time siinipling is the selection of behavioral units for observation at different points in
time. They can be selected in systematic or random ways to obtain samples of behaviors.
A good example is teacher behavior. Suppose the relations between certain variables like
teacher alertness, fairness, and initiative, on the one hand, and pupil initiative and coop-
erativeness, on the other hand, are studied. We may select random samples of teachers
and then take time samples of their behavioral acts. These time samples can be systematic:
three five-minute observations at specified times during each of, say, five class hours, the
class hours being the first, third, and fifth periods one day and the second and fourth
periods the next day. Or they can be random; five five-minute observation periods se-
lected at a specified universe of five-minute periods. Obviously, there are
random from
many ways samples can be set up and selected. As usual, the way such samples
that time
are chosen, their length, and their number must be influenced by the research problem.'^
Time samples have the important advantage of increasing the probability of obtaining
representative samples of behavior. This is true, however, only of behaviors that occur
fairly frequently. Behaviors that occur infrequently have a high probability of escaping the
sampling net, unless huge samples are drawn. Creative behavior, sympathetic behavior,
and hostile behavior, for example, may be quite infrequent. Still, time sampling is a
positive contribution to the scientific study of human behavior.
Time samples, as implied earlier, suffer from lack of continuity, lack of adequate
context, and perhaps naturalness. This is particularly true when small units of time and
behavior are used. no reason why event sampling and time sampling cannot
Still, there is
sometimes be combined. If one is studying classroom recitations, one can draw a random
sample of the class periods of one teacher at different times and observe all recitations
during the sampled periods, observing each recitation in its entirety.
RATING SCALES 16
To this point, we have been talking only about the observation of actual behavior. Ob-
servers look at and listen to the objects of regard directly. They sit in the classroom and
observes teacher-pupil and pupil-pupil interactions. Or they may watch and listen to a
group of children solving a problem behind a one-way screen. There is another class of
behavioral observation, however, that needs to be mentioned. This type of observation
'•if one takes the more active view of observation advocated by Weick (see footnote 1). however, one can
arrange situations to ensure more frequent occurrence of rare events.
"In a fascinating study of leadership and the power of group influence with small children, Merei points out
that time sampling would show only leaders giving orders and the group obeying,
whereas prolonged observa-
tions would show the inner workings of ordering and obeying. F. Merei, "Group Leadership and
Institutionalization," Human Relations. 2 (1949), 23-39.
'*For an excellent discussion of rating scales, see J. Guilford, Psychometric Methods. 2d ed. New York:
There are four or five types of rating scales. Two of these types were discussed in Chapter
29: check lists and forced-choice instruments. We consider now only three types and their
characteristics. These are the category rating scale, the numerical rating scale, and the
graphic rating scale. They are quite similar, differing mainly in details.
The category rating scale presents observers or judges with several categories from
which they pick the one that best characterizes the behavior or characteristic of the object
being rated. Suppose a teacher's classroom behavior is being rated. One of the character-
istics rated, say, is alertness. A category item might be:
A different form uses condensed descriptions. Such an item might look like this:
Numerical rating scales are perhaps the easiest to construct and use. They also yield
numbers that can be directly used in statistical analysis. In addition, because the numbers
may represent equal intervals in the mind of the observer, they may approach interval
measurement.'^ Any of the above category scales can be quickly and easily converted to
numerical rating scales simply by affixing numbers before each of the categories. The
numbers 3,2, 1 0, or 4, 3, 2, 1 can be affixed to the alertness item above. A convenient
, ,
method of numerical rating is to use the same numerical system, say 4, 3,2, 1, 0, with
each item. This is of course the system used in summated-rating attitude scales. In rating
scales, it is probably better, however, to give both the verbal description and the nu-
merals.
In graphic rating scales lines or bars are combined with descriptive phrases. The
alertness item, just discussed, could look like this in graphic form:
Such scales have many varieties: vertical segmented lines, continuous lines, unmarked
lines, lines broken into marked equal intervals (as above), and others. These are probably
the best of the usual forms of rating scales. They fix a continuum in the mind of the
observer. They suggest equal intervals. They are clear and easy to understand and use.
Guilford overpraises them a bit when he says, 'The virtues of graphic rating scales are
many; their faults are few," but his point is well taken."*
Ratings have two serious weaknesses, one of them extrinsic, the other intrinsic. The
extrinsic defect is that they are seemingly so easy to construct and use that they are used
for the rating of one characteristic to influence the ratings of other characteristics. Halo is
difficult to avoid. It seems to be particularly strong in traits that are not clearly defined,
^°
not easily observable, and that are morally important.
Two important sources of constant error are the error of severity and the error of
leniency. The error of severity is a general tendency to rate all individuals too low on all
characteristics. This is the tough marker: "Nobody gets an A in my classes." The error of
leniency is the opposite general tendency to rate too high. This is the good fellow who
loves everybody —
and the love is reflected in the ratings.
Anexasperating source of invalidity in ratings in the error of central tendency, the
general tendency to avoid all extreme judgments and rate right down the middle of a rating
'^Ibid.. p. 268.
'''Guilford's advice is invaluable: ibid., pp. 264-268 and 293-296.
^"Ibid.. p. 279.
496 • Methods of Observation and Data Collection
scale. It manifests itself particularly when raters are unfamiliar with the objects being
rated.
There are other less important types of error that will not be considered. More impor-
tant is how to cope with the types listed above. This is a complex matter that cannot be
discussed here. The reader is referred to Guilford's chapter in Psychometric Methods
where many devices for coping with error are discussed in detail."'
Rating scales can and should be used in behavioral research. Their unwarranted,
expedient, and unsophisticated use has been rightly condemned. But this should not mean
general condemnation. They have virtues that make them valuable tools of scientific
research: they require less time than other methods; they are generally interesting and easy
for observers to use; they have a very wide range of application; they can be used with a
large number of might be added that they can be used as adjuncts to
characteristics. It
other methods. That is, they can be used as instruments to aid behavioral observations,
and they can be used in conjunction with other objective instruments, with interviews, and
even with projective measures.
One of the most carefully developed systems of classroom observation is the Observation
Schedule and Record (OScAR), which was designed to permit the recording of as many
significant aspects as possible of what goes on in classrooms.-- The individual items were
found to group themselves into three relatively independent and reliable dimensions:
Emotional Climate, Verbal Emphasis, and Social Organization. That is, the items belong-
ing to each of these dimensions, or factors, were combined and treated as variables. Two
items from each of the first two dimensions, respectively, are: "Teacher demonstrates
affection for pupil" (positive), "Pupil ignores teacher's question" (negative); "Pupil
reads or studies at his seat," "Pupil (or teacher) uses supplementary reading matter." The
third dimension reflects social grouping in classes — for example, a class broken up into
two or more groups working independently. It can be seen that this is a low inference
system.
Medley and Mitzel say that the three dimensions represent what are probably obvious
differences between classes, but the OScAR fails to tap aspects of classroom behavior
-'Systematic errors can be dealt with to some extent by statistical means. Guilford has worked out an
ingenious method using analysis of variance. The basic idea is that variances due to siibjecis. judges, and
characleristics are extracted from the total variance of ratings. The ratings are then corrected. An easier method
when rating individuals on only one characteristic two-way (correlated-groups) analysis of variance. Reliabil-
is
ity can also be easily calculated. The use of analysis of variance to estimate reliability, as we learned earlier, was
Hoyt's contribution. Ebel applied analysis of variance toreliability of ratings. See Guilford, op. cii.. pp.
280-288, 383. 395-397; R. Ebel, '•Estimation of the Reliability of Ratings. ' Psychometrika, 16 (1951).
407-424.
--The system, with research data, is described in Medley and Mitzel, op. cit.. pp. 278-286.
Observations of Behavior and Sociometry • 497
related to achievement ot cognitive objectives. They are pR)bably loo harsh on their own
system. The three dimensions of OScAR are important.
In one of the relatively few — and better — studies of college teachers and teaching, Isaac-
son and his colleagues, after considerable preliminary work on items and their dimensions
or factors, had college students rate and evaluate their teachers based on their remembered
observations and impressions.""* They used a 46-item rating scale and instructed the stu-
dents to respond according to the frequency of the occurrence of certain behavioral acts
and not according to whether the behaviors were desirable or undesirable. Their basic
interest was dimensions or underlying variables behind the items. They found six
in the
such dimensions of which the first they thought to be related to general teaching skill.
Although the six factors are important because they seem to show various aspects of
teaching —
for example. Structure, which is the instructor's organization of the course and
its activities, and Rapport, which is the more interactive aspects of teaching and friendli-
-'G. Morrison. S. Forness. and D. MacMillan. '"Influences on the Sociometric Ratings of Mildly Handi-
capped Children: A Path Analysis." Journal of Educationul Psychology. 75 (1983). 63-74.
-•R. Isaacson et al.. "Dimensions of Student Evaluations of Teaching." Jourmil of Educaiional Psychol-
ogy. 55 (1964), 344-351 A number of similar studies have been published since this study appeared, but il is
.
stiil one of the best, I think. Note that here we have an observation system that was not devised deliberately to
measure variables but rather performance. Nevertheless, its two basic dimensions can
to help evaluate teaching
of course be used as variables in research. A
remarkable aspect of studies to evaluate college teachers is that
researchers seem not to be aware that the purpose of such observation systems should be the improvement of
instruction (or to use their dimensions as research variables) and not for administrative purposes. See F. Kerlin-
ger. "Student Evaluation of University Professors," School and Society. 99 (1971), 353-356.
498 • Methods of Observation and Data Collection
While we may question calling this study and others like it observation studies, there
is certainly observation, though it is quite different in being remembered and indirect,
global and highly inferential, and, finally, much less systematic in actual observation. We
ask students to remember and rate behaviors that they may not have paid particular
attention to. Nevertheless, the Isaacson et al. and other studies have shown that this form
of observation can be reliably used in instructor and course evaluation.
There are a number of important observation systems devised to study group interaction.
The best-known is the Bales system mentioned earlier. Borgatta, too, devised a system,
called Behavior System Scores (BSs), that is virtually an interaction analysis system. ^^ It
has the virtues of being brief, fairly simple, and based on factor analysis. Its six categories
apparently measure two basic dimensions: Assertiveness and Sociability. Examples of the
categories of behavior in each of the factors are: assertions or dominant acts (draws
attention, asserts, initiates conversation, etc.) and supportive acts (acknowledges, re-
sponds, etc.). Such a system may be useful in behavioral research whose focus is group
interaction and behavior —
decision-making groups, for example.
actually do when confronted with different circumstances and different people. Moreover,
in much, perhaps most, behavioral research, it is probably not necessary to use the larger
observation systems. As shown earlier, smaller systems can be devised for special re-
search purposes. Keeves' limited system was highly appropriate for his purpose. In any
case, scientific behavioral research requires direct and indirect observations of behavior,
and the technical means of making such observations are becoming increasingly adequate
- E. Borgatta, "A New Systematic Obser\'ation System: Behavior Scores System (BSs System)," Journal
of Psychological Slutiies. 14 (1963). 24-44. Also described briefly by Weick, op. cit.. pp. 400-401,
Observations of Behavior and Sociometry • 499
and available. The next decade should see considerable understanding and improvement
of methods of observation, as well as their increased meaningful use.
SOCIOMETRY
We constantly assess the people we work with, go to school with, live at home with. We
judge them for their suitability to work v\ ith us, play with us. live with us. And we base
our judgments on our observations of their behavior in different situations. We judge, we
say, on the basis of our "'experience." The form of measurement we now consider,
sociometry. is based on these many informal observations. Again, the method is based on
remembered observations and the inevitable judgments we make of people after observing
them.
With whom would you like to work (play, sit next to, and so on)?
Which two members of this group (age group, class, club, for instance) do you like the most
(the least)?
Who are the three best (worst) pupils in your class?
Whom would you choose to represent you on a committee to improve faculty welfare?
What four individuals have the greatest prestige in your organization (class, company, team)?
What two groups of people are the most acceptable (least acceptable) to you as neighbors
(friends, business associates, professional associates)?
^""For further discussion, see G. Lindzey and D. Byrne. "Measurement of Social Choice and Interpersonal
Attractiveness." In G. Lindzey and E. Aronson. eds.. The Handbook of Social Psychology. 2d ed. Reading,
Mass.: Addison-Wesley, 1968. vol. II. chap. 14.
500 • Methods of Observation and Data Collection
all. and the other numbers representing intermediate degrees of liking to work with him."
Clearly, other methods of measurement can be used. The main difference is that sociome-
try always has such ideas as social choice, interaction, communication, and influence
behind it.
Sociometric Matrices^^
this means the entry in the /'th row and 7th column of the matrix, or, more simply, any
entry in the matrix. It is convenient to write sociometric matrices. These are matrices of
numbers expressing all the choices of group members in any group.
Suppose a group of five members has responded to the sociometric question, "With
whom would you like to work on such-and-such a project during the next two months'?
Choose two individuals." The responses to the sociometric question are, of course,
choices. If a groupmember chooses another group member, the choice is represented by
1 If a group member does not choose another, the lack of choice is represented by 0. (If
.
rejection had been called for. -1 could have been used.) The sociometric matrix of
choices, C, of this hypothetical group situation is given in Table 31.1.
It is possible to analyze the matrix in a first let us be sure we
number of ways. But
know how to read the matrix. It is probably easier to read from left to right, from / to 7.
Member chooses (or does not choose) member 7. For example, a chooses b and e; c
/
Data Analysis of Single Sociometric Relations." In ibid., chap. 4. The latter chapter is important because it
shows, among other things, the application of log-linear analysis (see chap. 10, Addendum, above) to sociomet-
ric data.
'^See Lindzey and Byrne, op. dr.. pp. 470-473. for a good review of matrix analysis. Explanation of
elementary matrix operations and sociometric matrices can be found in: J. Kemeny, J. Snell, and G. Thompson.
Introduction to Finite Mathematics. 2d ed. Englewood Cliffs, N.J.; Prentice-Hall, 1966, pp. 217-250, 384-
406. An old but valuable review of mathematical and statistical methods for analyzing group structure and
communication is: M. Glanzer and R. Glaser, "Techniques for the Study of Group Structure and Behavior:
Figure 31.1
interpretations can be depicted by a matrix such as that of Table 31.1 and by a directed
graph. A directed graph is given in Figure 31.1.
We see at a glance that e is the center of choice. We might call him a leader. Or we
might call him either a likable or competent person. More important, notice that a. b. and
e choose each other. This is a clique. We define a clique as three or more individuals who
mutually choose each other. ""' Looking for more double-headed arrows, we find none.
Now we might look for individuals with no arrowheads pointing at them; c is one such
individual. We can say that c is not chosen or neglected.
Note that directed graphs and matrices say the same thing. We look at the number of
choices a receives by adding the Ts in the a column of the matrix. We get the same
information by adding the number of arrowheads pointing at a in the graph. For small and
medium-size groups and for descriptive purposes, graphs are excellent means of summa-
rizing group relations. For larger groups (larger than 20 members?) and more analytic
purposes, they are not as suitable. They become difficult to construct and to interpret.
Moreover, different individuals can draw different graphs with the same data. Matrices
are general, and, if handled properly, not too difficult to interpret. Different individuals
must, with the same data, write exactly the same matrices.
Sociometric Indices^'
In sociometry many indices are possible. Three are given below. The student will find
others in the literature.
A simple but useful index is:
= '— (31.1)
CSi
n - 1
where CSj = the choice status of Person = the sum of choices in Column y; and
j: l,Cj
n = the number of individuals in the group (« — 1 is used because one cannot count the
individual himself). For C of Table 31.1, CS^ = 4/4 = 1.00 and CS„ = 2/4 = .50. How
^"L. Festinger, S. Schachter. and K. Back. Social Pressures in Informal Groups. New York: Harper & Row.
1950, p, 144. This book is not only a report of highly interesting research; it also contains a valuable method for
identifying cliques in groups. See, also. Glanzer and Glaser, op. cii.. pp. 326-327. which succinctly outlines
methods of the multiplication of binary matrices (1,0), whose application yields useful insights into group
structure.
^'The discussion of this section is for the most part based on: C. Proctor and C. Loomis, "Analysis of
Sociometric Data." In M. Deutsch, and S. Cook, Research Methods in Social Relations. New York: Holt,
Rinehan and Winston, 1951, pt. 2, chap. 17.
Observations of Behavior and Sociometry • 503
well or how poorly chosen an individual is is revealed by CS. It is, in short, his choice
status. It is of course possible to have a choice rejection index. Simply put the number of
O's in any column in the numerator of Equation 31.1.
Group siiciometric measures are perhaps nn)re interesting. A measure of the cohesive-
ness of a group is:
Co = ~ -—
2((
n{n
*->
/)
1)
(31.2)
^51^.
© 10
If, in an unlimited choice situation, there were 2 mutual choices, then Co = 2/10 = .20, a
rather low degree of cohesiveness. In the case of limited choice, the formula is:
Co = —
S(i •^ /')
(31.3)
dn/2
where d= the number of choices each individual is permitted. For C of Table 31.1
Co = 3/(2 X 5/2) = 3/5 = .60, a substantial degree of cohesiveness.
Prejudice in Schools
In studying the manifestation of prejudice against blacks and Jews in schools. Smith used
the simple procedure of asking all the students of entire grades of high schools to name
their five best friends.^"* (Smith calls it "a straightforward approach that has been digni-
'^
For a discussion of the basic measurement aspects of sociometric measures, especially their reliability and
validity, see Lindzey and Byrne, op. cit., pp. 475-483.
"T. Newcomb, Personality and Social Change. New York: Holt, Rinehart and Winston. 1943, pp. 54-55.
'-'M. Smith. "The Schools and Prejudice: Findings." In C. Clock and E. Siegelman. eds.. Prejudice U. S.
A. New York: Praeger. 1969. chap. 5.
504 • Methods of Observation and Data Collection
fied by the label of 'sociometric method.""") He then collated the responses with the
^^^
responses of the students named to determine ethnic and religious group membership.
The students tended to choose their friends from their own racial and religious groups
hardly surprising. More important. Jews and blacks were not chosen as friends by mem-
bers of other ethnic and religious groups. White students hardly chose black students at
all.While Smith specifically says that he does not want to ascribe his findings to preju-
dice, it seems clear that "the virtually unpenetrated barrier" between black and white
students reflects prejudice. It is evident that a sociometric approach in the study of preju-
dice can yield important data.
In the Morrison et al. study of the determinants of social status among mildly handicapped
students cited earlier (footnote 23), social status (the dependent variable) was measured by
asking the children in classes to select one of four responses for each of their classmates; a
smiling face (acceptance), a straight-mouthed face (no preference), a frowning face (re-
jection), and a question mark (nonacquaintance). A weighted average score for each child
was calculated as follows; 3 = acceptance. 2 = tolerance (no preference), and 1
=
rejection. These averages were the which were correlated with other
social status scores,
variables, e.g.. disruptive behavior, teacher rating of behavior, achievement. The two
strongest influences on social status were teacher rating of cognition (positive) and student
rating of behavior (negative).
usually be used. They have considerable tlcxibilily. If defined broadly, they can be
adapted to a wide variety of research in the laboratory and in the field. Their quantification
and analysis possibilities, though not generally realized in the literature, are rewarding.
The ability to use the simple assignment of Is and O's is particularly fortunate, because
powerful mathematical methods can be applied to the data with uniquely interpretable and
meaningful results. Matrix methods are the outstanding example. With them, one can
discover cliques in groups, communicatit)n and influence channels, patterns of cohesive-
ness, connectedness, hierarchization, and so on.
Study Suggestions
1. The student should study one or two behavior observation systems in detail. For students of
education, the Medley and Mitzel system will yield high returns. Other studenLs will want to study
one or two other systems. The best source for educational systems is Medley and Mitzel's chapter
(see footnote ). It is authoritative and clear with many examples. The two best general references
I
are the Heyns and Lippilt chapter (footnote 2) and the Weick chapter (footnote 1) in the first and
second editions of the Handbook of Social Psychology. An anthology of 79 observation systems has
been published in cooperation with Research for Better Schools, Inc., a regional education labora-
tory (see footnote 7). The researcher who intends using observations should consult this huge
collection of systems. The student of education will find excellent summaries and discussions of
educational observation systems in; M. Dunkin and B. Biddle, The Study of Teaching. New York:
Holt. Rinehart and Winston. 1974. The following articles are valuable: R. Boice, "Observational
Skills," Psychological Bulletin, 93 1983), 3-29; J. Herbert and C. Attridge, "A Guide for Devel-
(
opers and Users of Observation Systems and Manuals," American Educational Research Journal,
12 (1975), 1-20. Boice points out the lack of training for making observations of behavior and
makes suggestions for such training. Herbert and Attridge provide criteria for observation systems.
They also point out that knowledge of such systems is limited.
2. Sociometry has been neglected by behavioral researchers and methodologists. As its impor-
tance and analytic usefulness become better known and appreciated, and as computer programs are
written to handle large amounts of data generated, mathematical and statistical methods of socio-
metric and related data analysis will probably exert a stronger influence on behavioral research. The
student is encouraged to explore mathematical treatments of sociometric data. The Kemeny, Snell,
and Thompson reference (footnote 28) is a good introduction, though the student needs knowledge
of elementary matrix algebra (which, fortunately, is not difficult). An introduction to the subject is:
E. Pedhazur, Multiple Regression in Behavioral Research: Explanation and Prediction, 2d ed. New
York: Holt, Rinehart and Winston, 1982, Appendix A, pp. 113-7S3. A highly valuable guide to
mathematical and statistical analysis is the following article: M. Glanzer and R. Glaser, "Tech-
niques for the Study of Group Structure and Behavior: I. Analysis of Structure," Psychological
Bulletin. 56 (1959), 317-332.
3. An investigator, studying the influence patterns of boards of education, obtained the follow-
ing matrix from one board of education. (Note that this is like an unlimited choice situation because
each individual can influence all or none of the members of the group.) Read the matrix; i influ-
ences j.
506 • Methods of Observation and Data Collection
4. For the situation in Study Suggestion 3, calculate the cohesiveness of the group using
Eq. 31.2.
[Answer: Co = .40.]
Chapter 32
Q Methodology
centers particularly in sorting decks of cards called Q sorts and in the correlations among
the responses of different individuals to the Q sorts.
'W. Stephenson. The Study of Behavior. Chicago: University of Chicago Press. 1953.
508 • Methods of Observation and Data Collection
to rank order them. For statistical convenience, the sorter is instructed to put varying
numbers of cards in several piles, the whole making up a normal or quasi-normal distribu-
tion. Here is a C^sort distribution of 90 items:
Most Least
Approve Approve
3 4 7 10 13 16 13 10 7 4 3
This is
10 98
a rank-order continuum from
7 6
"Most Approve"
5 4
to
3210
"Least Approve" with varying
degrees of approval and disapproval between the extremes. The numbers 3, 4, 7, ... ,
7, 4, 3 are the numbers of cards to be placed in each pile. The numbers below the line are
the values assigned to the cards in each pile. That is, the 3 cards at the left, "Most
Approve," are each assigned 10. the 4 cards in the next pile are assigned 9, and so on
through the distribution to the 3 cards at the extreme right, which are assigned 0. The
center pile is a neutral pile. The subject is told to put cards that are left over after he has
made other choices, cards that seem ambiguous to him or about which he cannot make a
decision, into the neutral pile. In brief, this Q distribution has 11 piles with varying
numbers of cards in each pile, the cards in the piles being assigned values from through
10. Statistical analyses are based on these latter values.
Sorting instructions and the objects sorted vary with the purposes of the research.
Subjects can be asked to sort attitudinal statements on an approval-disapproval contin-
uum. They can be asked to sort personality items on a "like me"-"not like me" contin-
uum. Judges can sort behavioral statements to describe an individual or a group. Aesthetic
objects, like pictures or abstract drawings, can be sorted according to degree of prefer-
ence.
The number of cards in a 2 distribution is determined by convenience and statistical
demands. For statistical stability and reliability, the number should probably be not less
than 60 (40 or 50 in some rare cases) nor more than 140, in most cases no more than 100.
A good range is from 60 to 90 cards. -^
Table 32.3
510 • Methods of Observation and Data Collection
A Miniature Q Sort
Most
12
Approve
4 3
4
2 10
2
Least
Approve
1
Again, the figures above the line are the numbers of cards in the piles; those below the line
are the values assigned to each of the piles.
Suppose four subjects sort the cards as instructed and that the values below the line
have been assigned to the cards in the piles. The four sets of values are given in Table
32.4. The numbers in the four columns under a, b. c, and d are the values assigned to the
cards in the five piles after the four persons, a, b, c and d, have sorted the cards. By
inspection we suspect that the Q sorts of Persons a and b are very similar. The Q sorts of
Persons c and d look similar. (With these two pairs of persons, note that the high values
and the low values tend to go together.) We can also see that there seems to be little
relation between the Q sorts of a and c. b and c. and b and d. To be more precise, we again
"*
need to calculate coefficients of correlation.
The r"s between the four sets of values of Table 32.4 are given in the correlation
matrix of Table 32.5.
The interpretation of these correlations presents no difficulties. Obviously Persons a
and b sort the cards very similarly: r — .92. Persons c and d. too, are similar: r = .75. All
the rest of the r's are near zero. Evidently there are two kinds of "types" of persons,
insofar as attitudes toward education are concerned: "A-kind" and "B-kind."
Table 32.5 Correlation Matrix from the g-Sort Values of Table 32.4
512 • Methods of Observation and Data Collection
personality or attitude scale: they are selected and used because they presumably measure
one broad variable, like neuroticism, attitudes toward blacks, or adjustment.
There is a theoretical infinite population of items, and the hope is that the set of items
used by the investigator in his Q sort is a representative sample of this item population.
One important population of items, used by Rogers and others, is focused on the percep-
tion of self and others. A large number of statements about the self is assembled or
constructed; "Hike people," "I am a failure," "I just can't seem to make up my mind on
things," and so on.
Individuals are asked to sort the cards to describe themselves as they think they are, as
other people see them, and the like. The cards are sorted into a Q distribution, the sorts
intercorrelated, and the principal analysis focuses on the correlations among persons and
on factor or cluster analysis. Inferences about the efficacy of therapy or training programs
are then drawn from the results of the analysis. For example, trainees in teacher train- —
ing, the Peace Corps, doctoral programs —
sort the cards before, during, and after the
training. Assume that there is an "ideal" or criterion Q sort available provided by the
trainers. One reasons that if the training has been effective and one of the important —
results of training is measured by the Q sort —
then the correlations between the trainees"
sorts and the criterion sort are higher after the training than before it.
Correlation approaches have more or less dominated Q studies.*' One of Stephenson's
important contributions, the testing of "theory" and the principle of building "theory"
into Q sorts by means of structured samples of items, has been neglected.^
Structured Q Sorts
of which are in one domain — for example, social- values items, self items, teacher-
characteristics items — but which are not otherwise differentiated in the Q sort or in the
analysis. While the items of a structured Q sort, on the other hand, are in one domain,
they are partitioned in one or more ways. For instance, a child psychologist may be study-
ing moral growth in children. One aspect of his theory implies that as children get older,
control of their behavior becomes more internal. AQ sort can be structured as Internal-
external, with half the items reflecting internal control and half external control. This is the
simplest possible partition of a structured Q sort.
To structure a Q sort is virtually to build a "theory" into it. Instead of constructing
instruments to measure the characteristics of individuals, we construct them to embody or
epitomize "theories." In the use of Q as Stephenson sees it, individuals as such are not
tested; theoretical propositions are tested. Naturally, individuals must do the Q sorting.
And Q sorts can, of course, be used to measure characteristics of individuals. But the
basic rationale of Q. as Stephenson sees it, is that we have individuals sort the cards not so
much to test the individuals as to test "theories" that have been built into the items.
Building a theory into a measurement instrument, while not frequent, is not new. A
, '
* J Wittenbom, "Contributions and Current Status of
.
Q Methodology ' Psychological Bidlcilii , 58 ( 1 96 1 )
132-142.
^Ibid.. pp. 138-139. See Stephenson, op. cii.. pp. 66-85.
Methodology • 513
based on Spranger's theory of six types of men: Theoretical, Economic, Aesthetic, Social,
and Religious.** The purpose of the instrument is not to test the theory but to
Political,
measure the values of individuals. If a person is, say, basically a religious "type," he
should select items of the Religious category over items of other categories.
The Stephenson approach to the same problem would be to test the Spranger theory.
(Note that the Siiuly of Values can be used to test the Spranger theory.) A Q sort would be
constructed using the Spranger system as a guide. Items would be selected from various
sources and specially written to represent the six Spranger values. There would be 10 to
15 Theoretical items, 10 to 15 Aesthetic items, and so forth, making a total of 60 to 90
items in the entire Q sort. Individuals would then be selected to "represent" the six
values. For example, the investigator might select ministers and priests (Religious), busi-
nessmen (Economic), artists and musicians (Aesthetic), scientists and scholars (Theoreti-
cal), and so on."
If the theory is "valid," and if the Q sort adequately expresses the theory, two rather
big "it's." the statistical analyses of the sorts should show the theory's validity. That is, if
any individual with "known" values — a minister or priest can be expected to have strong
religious values, an artist strong aesthetic values — takes the sort with instructions to place
favored or approved statements high and disapproved statements low, we would expect
him to place the 10 or 15 statements congruent with his role and its associated values high.
Statements associated with other roles and values we would expect him to place lower. If
a scientist sorts the cards, we would expect him to place the cards in the Theoretical
category high and cards of other categories relatively lower. Naturally, there will be few
individuals whose sorts will be so clear-cut. Human beings and their attitudes and values
are too complex. But we can expect some such results to occur beyond chance expectation
if the theory is valid.
AQ sort suggested by the above considerations can be called a "one-way structured
sort," because there is one basis of variable classification. '" This is directly analogous to
simple one-way analysis of variance. To make the matter clear, an example from a re-
search study designed in part to test the Spranger theory in Q fashion can be given. A
90-item Q sort was used. Each item was word having been previously
a single word, each
categorized by judges in the six Spranger values. There were 15 Theoretical words
science, knowledge, reason, and so on; 15 Religious words God, church, sermon, and
so on; to a total of six categories and 90 words.
The cards were sorted by a number of persons chosen for their presumed possession of
the six values. They were asked to sort the cards according to the degree to which they
favored or did not favor the words on the cards. To illustrate the results, here are the mean
values in rank order of the sort of one subject, a musician:
a s I
p e r
F= 26.82, significant at the .001 level. This means that the musician significantly differ-
entiated the six values. What is the pattern of differentiation? The wider spaces indicate
*G. Allport, P. Vemon. and G. Lindzey. Study of Values, rev. ed. Boston: Houghton Mifflin. 1951.
'In an interesting Q study of religious attitudes. Broen selected 24 clergymen to represent the full spectrum
of religious beliefs and attitudes. There were four representatives of each of five major religious groupings.
Broen says that the subjects were selected from churches and institutions known to have religious orientations in
the directions of his hypothesized religious categories. W. Broen, "A Factor Analytic Study of Religious
Attitudes." Journal of Abnormal and Social Psychology. 54 (1957), 176-179.
'"Stephenson has not stressed the possibility of Q
sorts of the one-way type. His Q designs are almost all of
the factorial two- and three-way type. There seems to be no reason why one-way designs cannot be used.
514 • Methods of Observation and Data Collection
significant gaps." Although it is the highest mean, there is no significant gap between
Aesthetic and the next highest mean, Social (6.27). In fact. Aesthetic, Social, and Theo-
retical form a subset which is separated by a significant gap from all the other means.
Political and Economic form another subset. Religious, the lowest mean 1 .40), is signifi- (
cantly separated from all the other means. Evidently the musician highly favors Aesthetic,
Social, and Theoretical words, and strongly disfavors Religious words. From this analysis
we may perhaps draw inferences as to her value system, at least insofar as the measure-
ment system allows. Independent knowledge of the subject confirmed this analysis.
If a set of stimuli can be categorized in this fashion, a one-way structured Q sort may
be possible to construct and desirable to use. Small theories and hypotheses can be tested
in this manner by having subjects of known values, attitudes, personality, roles, and so
forth sort the cards. The student should realize that in addition to the analysis of variance
structured sort approach, correlation analysis is always applicable. Simply correlate the Q
sorts of different persons, and disregard the structure built into the sort.
Many theories and hypotheses that can be structured along the lines of analysis of variance
paradigms have the potentiality of being tested with Q methods. The Spranger example
just discussed is a case in point. Other one-way examples might be: introversion-extrover-
sion; oral eroticism-anal eroticism; progressivism-traditionalism; liberalism-conservatism;
open mindedness-closed mindedness; and so on. But how about more complex theories
and hypotheses? Taking the next logical step, we add another dimension or variable to the
Q paradigm. This makes a two- variable Q sort and a two- variable or factorial analysis of
variance design. The Q sort is structured in two ways rather than one. This means that
every item in the Q sort reflects facets of the two variables. A little reflection will show
that this can often be quite difficult.
To illustrate two-way structured sorts, the paradigm of an 80-item sort to explore
social attitudes and to test a structural "theory" of attitudes is given in Table 32.6. '- The
means of a known conservative individual are also given in the table.
First note the Q-son structure. The two main variables are Atlitudes and Abstractness.
Attitudes is partitioned into Liberal (L) and Conservative (C), Abstractness into Abstract
(A) and Specific (5). Any item of the Q sort must fit into one of the four cells of the cross
partition. Any attitude item must be either Conservative or Liberal and at the same time
either Abstract of Specific. The more important variable \s Attitudes. The second variable,
Abstractness, was incorporated into the structure because it was conjectured that both
liberals and conservatives would react differently to the abstractness-specificity of items.
"A simple way to do this test is to calculate the standard error of the difference between means. The for-
mula is
where V„ = within-groups variance (from the analysis of variance). Multiply this value by 2;
—+
^nR¥ -jt) = 2V2.34 X 1.33 = 2 X .558 =1.12
Any difference equal to or greater than 1 12 is significant. Perhaps a more legitimate but more conservative test
.
isScheffe's. See Chapter 13, Study Suggestion 6. Also see A. Edwards, Statistical Methods. 2d ed. New York:
Holt, Rinehart and Winston. 1967, pp. 265-269.
'-F, Kerlinger, "A g Validation of the Structure of Social Attitudes," Educational and Psychological
Measurement. 32 ( 19721, 987-995. The theory tested is discussed in: F. Kerlinger. "Social Attitudes and Their
Criterial Referents: A Structural Theory," Psychological Review. 74 (1967), 110-122.
Methodology • 515
AlliluJes
Liberal (Z.) Conservative (C) Means
Abstract (A) 3.15 4.70 3.93
Abslractness
^^^-^f^^ ^s) 2.45 5.70 4.07
For example, a consei^ative might strongly endorse specific conservative items, while
another conservative might endorse abstract conservative items and similarly for liber- —
als. Whether the structure is valid is, of course, an empirical matter. At any rate, here are
four items, one for each of the cells of Table 32.6. The labels correspond to those given in
the table and on page 514.
I LA)
social equality
Supreme Court (LS)
competition fCA)
private property (CS)
The data of this individual's sort can be analyzed with analysis of variance, provided
we use care and circumspection in the interpretation of the data. (The questionable nature
of using analysis of variance with Q-son data will be discussed later.) The analysis of
variance to use, of course, is the factorial type. The individual whose data (means) are in
the table is a "known" conservative who evidently favors conservative referents of the
specific kind. The L and C means are 2.80 and 5.20, a difference of 2.40, which is
significant at the .01 level. The difference between the A and S means (3.93 and 4.07) is
not significant.
Much more interesting, note the Specific row means: 2.45 and 5.70. Contrast this with
the Abstract row means: 3.15 and 4.70. The two differences lead to an interaction F ratio
of 5.23, significant at the .05 level. Although we should treat .05-level differences with
caution, these results indicate the subject's preference for conservative and specific refer-
ents. Of would probably rank private property above
the four referents given above, he
the others. Since we knew was conservative before we started, this is
that this individual
some small confirmation of the validity of the reasoning that went into the Q sort.
The use of the abstractness dimension in this sort was not dictated by theory; it was
purely exploratory. A number of individuals (14 out of 33), however, had significant
interaction F ratios, which indicates that it makes a difference to many individuals
whether referents are abstract or specific. This may even mean that we would have to talk
about abstract and specific liberalism and conservatism. It may also mean that conserva-
tives tend to favor specific referents, while liberals favor abstract referents. Despite these
findings, significant interactions are ordinarily not expected; the principal interest in Q is
tions cannot be predicted on the basis of theory or hunch and deliberately studied just as
in experimental work in learning and teaching.
Before leaving structured sorts, it should be mentioned that Q designs are not limited
to the simple 2x2
case shown above. Other combinations 3 x 2, 4 x 3. 2 x 4 are — —
possible. Three- and four- variable designs are also possible, if not too practicable. An-
other possibility, mentioned in an earlier chapter, is the application of the structure idea to
''See Stephenson, op. cit.. pp. 103 and 163-164. There are a number of interesting structured 2 sorts in this
volume and in Stephenson's later book: W. Stephenson, The Play Theory of Mass Communicaiion. Chicago:
University of Chicago Press, 1967.
516 • Methods of Observation and Data Collection
objective tests and scales. The items of the referents Q sort can be put into summated-
rating scale form, for instance, and scored and analyzed accordingly.''*
In a Q study of the relation between attitudes toward education and teachers' perceptions
of desirable teaching behaviors, Sontag used an 80-item Q sort whose items were brief
'^
descriptions of a large variety of teaching behaviors. Half of 80 elementary and second-
ary school teachers were instructed to sort the behaviors according to their desirability for
elementary teachers and the other half according to their desirability for secondary teach-
ers. The Q-soTt results for each group were factor analyzed separately and factor arrays
calculated. Four factors were obtained in each analysis. These factor arrays are complete
80-item Q weighted average of the relative values the teachers put on
sorts that describe a
the behaviors. Four of the highest values for two of the elementary-teacher behavior
factors, with the names of the factors, are:
One can readily get a feeling for the underlying themes behind the items, even with
these few items. Ordinarily, more positive items are needed to identify and interpret the
'"Some of the have been described in: F. Kerlinger. "Q Methodology in Behavioral Research."
possibilities
In S. Brown and D. Brenner, eds.. Science. Psychology, and Communication: B:ssays Honoring William
Stephenson. New York: Teachers College Press, 1972, chap. 1.
'"''For details of calculating factor arrays, see Stephenson, The Study of Behavior, pp. 176-179.
'*M. Sontag. "Attitudes Toward Education and Perception of Teacher Behaviors, American Educational
"
arrays. In addition, the negative ends of the arrays can and often should be used, since
they may be helpful Note, too, that the Q sorts of new subjects can be
in interpretation.
correlated with the arrays, a valuable procedure that has rarely been used. The correlations
can be used to identify the factor predispositions of students, teachers in training and in
service, administrators, and so on. This kind of use can be particularly valuable in studies
of attitude, value, belief, and perception (or judgment) change. The perceptions or judg-
ments of desirable teacher characteristics and behaviors before and after, say, special
training can be correlated with "ideal" perceptions of the trainers, as indicated earlier, or
with the factor arrays.
"This a dicey prediction, indeed, because one expects the factor analysis and the analysis of variance
is
While the results should agree in general if the theory holds, it is unrealistic to
results to agree in all particulars.
expect them to agree in all particulars. The example, however, illustrates what is meant by the affinity of Q and
theory. Theories that can be tested with Q are structural theories, that is, they are explanation of the relations
among the elements or variables of a phenomenon under study. For a discussion of different kinds of theories in
behavioral science and research, see Kerlinger, Liberalism and Consen-atism: The Nature and Structure of
Social Attitudes, op. cit., chap. 2.
518 • Methods of Observation and Data Collection
statistical tests assume independence. This means that the response to one item should not
be affected by the responses to other items. In Q the placement of one card somewhere on
the continuum should not affect the placement of other cards. If Q placements affect each
other, then the independence assumption is violated. Q is an ipsative, forced-choice
procedure, and be recalled that such procedures violate the independence assump-
it will
tion; the placement of one Q card affects the placement of other cards. It is, after all, a
rank-order method.
The real question is: How serious is the violation of the assumption? Is it serious
enough to invalidate the use of correlation and analysis of variance procedures? There is
no doubt that in an 80-item sort, there are not really 79 degrees of freedom. Thus, to some
extent at least, the analysis of variance procedure is vitiated. It is doubtful, however, that
too much is risked in Q statistical situations, if there is a fairly large number of items. One
can perhaps fall back on Fisher's advice given long ago: raise the requirements for statisti-
cal significance. Instead of accepting the .05 level in Q sorts, require the .01 level of
significance. In most cases of Q statistical significance encountered by the author, F ratios
are so high they leave little doubt as to statistical significance."'
Another criticism of Q has focused on the forced-choice feature of Q sorting. It has
been said that the forced procedure is unnatural, that it requires the subject to conform to
an unreasonable requirement. Furthermore, important information on elevation and scat-
ter is said to be lost. This means, for example, that two individuals can correlate highly
because their profiles are alike. Yet these two individuals might be quite unlike: one might
be high on a scale and the other low on the scale. (The computation of r takes no account
of mean differences, or differences in level or elevation.) The Q procedure throws away
levels differences between individuals.
On the constraint argument, all psychometric procedures are constraints on the indi-
vidual. Because an individual feels constrained in sorting Q sorts, however, is no really
good reason for declaring the procedure invalid. Most such inferences are probably made
by critics who think forced procedures constrain the individual. In the experience of the
author and his students, very few individuals complain about the procedure. Most of
them, indeed, seem to enjoy it. Livson and Nichols say that the Q sorter is his own worst
critic and that researchers should not be unduly alarmed by adverse sorter criticisms of the
method."" They recommend use of the forced procedure after careful study of alterna-
tives.
The evidence on the relative merits of forced and unforced Q sorts is mixed. Many
years ago Block found forced sorting equal or superior to unforced procedures.'^'' Jones,
on the other hand, found the forced procedure wanting."* Brown much more recently
concluded from his studies and experience that results are little affected by different
distributions and forced and unforced procedures."'' I believe that for its purpose the
forced sorting procedure is useful. Whether the distribution of items is normal, rectangu-
lar, or otherwise is not so important, though quasi-normal distributions seem to work
well. The important thing is to force individuals to make discriminations that they often
will not make unless required to do so.
The criticism on the loss of information in Q sorting through lack of elevation and
scatter is more serious. ^^ The argument is too complex to discuss here. The reader should
realize, however, that every time a coefficient of correlation is computed, the elevation
^'
It is well, however, to bear the independence stricture in mind. Instructions to subjects should not encour-
age lack of independence. That is. tell subjects that they can always move any card or cards from one pile to
another nght to the end of the sorting procedure. Moreover, one should not use g-sort "scores" normatively.
That is, one should not use the values assigned to the Q piles (see the Q-sort distributions of 90 items given
earlier and the numbers assigned to the piles) as though they were scores of individuals on a variable, add them
across persons, and use them in statistical tests of significance.
"N. Livson and T. Nichols. "Discrimmation and Reliability in Q-Sort Personality Descriptions." Journal
of Abnormal and Social Psychology. 52 1956). 159-165. The author recalls an amusing and instructive inci-
(
dent. Colleagues had been asked to sort a 90-item unstructured Q sort the items of which were single words. One
colleague, a philosopher, complained about the procedure. When he had finished, he said the procedure was
highly questionable, and that if he had to do it over again the results would certainly be quite different. He did
the sort again eleven months later. The coefficient of correlation between the first and second sorts was .81!
-'J. Block. "A Comparison of Forced and Unforced Q-Sorting Procedures." Educational and Psychologi-
cal Measurement. 16 (1956). 481-493.
'"'A. Jones. "Distribution of Traits in Current Q-Sort Methodology." Journal of Abnormal and Social
Psychology. 53 (1956). 90-95.
•^S. Brown. Political Subjectivity: Applications of Q Methodology in Political Science. New Haven: Yale
University Press. 1980. pp. 201-203. 288-289.
-''See L. Cronbach and G. Gleser, "Assessing Similarity Between Profiles." Psychological Bulletin. 50
(1953). 456-473.
520 • Methods of Observation and Data Collection
(mean) and scatter (standard deviation) of the sets of scores are lost. Q is not unique here.
Q is unique, however, in systematically using a procedure that sacrifices level and scatter.
All individuals have the same general mean and the same general standard deviation.
The practical answer is simple to state but not simple to implement: when elevation
and scatter are important, do not use ipsative measures. If you are comparing the mean
performances of two groups, for example, ipsative scores are of course inappropriate."^
If, on the other hand, mean differences are not important but the relations among variables
within individuals or groups are important, then ipsative scores may well be appropriate.
In the last analysis, the experience and judgment of the researcher are the final arbiters of
whether Q sorting should be used.
mental manipulations, simpler procedures are usually more appropriate. This stricture
applies whenever a hypothesis is tested by comparing the central tendencies, variabilities,
or relative frequencies of characteristics of groups of individuals. Q can profitably be used
for comparing the characteristics of groups of individuals only when comparing the rela-
tions within the groups. For example, we might test a hypothesis that two specified groups
of individuals, categorized on the basis of holding different values or attitudes, will also
cluster together similarly on some other measure presumably related to the values or
attitudes.
Some research problems lend themselves nicely to Q. Complex aesthetic judgments
and preferences are examples. Stephenson, in a brilliant tour de force, applied the struc-
tured Q-son idea to artistic judgments.-** He used actual squares and rectangles juxtaposed
in various ways as items in a (2 sort. Three variables were built into the sort: shape
dominance (regular-irregular), shape concentration (overlapping-not overlapping), and
color. He used an artist, himself, and graduate students as subjects, and made statistical
predictions based on an aesthetic theory.
Getzels and Csikszentmihalyi followed a related and equally interesting procedure.-^
They had of whom were highly competent but of differing degrees of
art students, all
artistic talent,produce 31 drawings under controlled conditions. Then they had experts
(artists) and nonexperts (graduate students) judge the drawings on craftmanship, original-
ity, and overall aesthetic value by Q sorting the drawings. That is, each judge (total of 20
judges) sorted or rated the drawings three times. Correlations between Q sorts enabled the
authors to conclude, among other things, that artists differed as much among themselves
in judgments as laymen, but that they evidently related originality more to overall value
than did the laymen. Both studies are themselves creative and original uses of Q to study
the complex and highly elusive problem of aesthetic judgment. It is difficult to conceive a
better empirical approach to the problem.
As Q can be used to open up new areas, to test preliminary theories,
indicated earlier,
to explore heuristic hunches. The problem of creativity has been tackled up to now almost
entirely with large A' cross-sectional methods. Following the lead of the two aesthetic
"This stricture has often been disregarded with unknown consequences. See footnote 21. above.
-^Stephenson. The Study of Behavior, pp. 128ff.
"J. Getzels and M. Czikszentmihalyi. "Aesthetic Opinion: An Empirical Study," Pubic Opinion Quar-
terly. 30 (1969), 34-45.
Methodology • 521
judgment studies just described, study of tlie stubborn but fascinating problem of creativ-
itycan be tackled. One might take a Guilford theory of convergent-divergent thinking or
a Barron originality theory and explore them with (?. *" Complex areas, especially psycho-
logical areas where intensive study of the individual is required, do not always yield too
readily to large N approaches. Q methods, adequately used, should be useful in laying
some of the research foundations in these areas.
One of the most valuable outcomes of a successful Q study is the factor array idea
discussed earlier. A factor array, recall, is a "new" Q sort constructed from factor
analytic results. The items of an array have Q values calculated from the Q sorts of the
persons loaded on a factor. These items and their array values express the essence or
content of a persons" factor. They epitomize the variable that the persons on the persons'
factor share to a substantial degree. They form, in other words, a prototype. They can be
used number of ways, even experimental. One of the most important is to assess the
in a
"agreement" with the prototype of untested individuals. For example, arrays can be
calculated for the two factors. Concern for Students and Structure and Subject Matter, that
Sontag found in his perceptions of teacher behaviors study outlined earlier (see the exam-
ples of items of the arrays given earlier). If we were studying, say, success in teaching and
had found that those teachers high on Structure and Subject Matter tended to be success-
ful, then we might use the prototype Q sort with teacher trainees. Perhaps further study of
those individuals whose sorts correlate substantially with the prototype may help us learn
more of the psychological and other characteristics that seem to be characteristic of
"good" teachers. Or we can assess the influence of a teacher-training program by having
trainees sort prototype Q sorts before, during, and after the training. Toward which proto-
type does the program lead students? The research potential of factor arrays seems to be
great. Maybe researchers in education and psychology should pay more attention to their
possibilities.
used too well with large samples. "" One can rarely generalize to populations from Q
persons samples. Indeed, one usually does not wish to do so. Rather, one tests theories on
small sets of individuals carefully chosen for their "known" or presumed possession of
some significant characteristic or characteristics. One explores unknown and unfamiliar
areas and variables for their identity, their interrelations, and their functioning. Used thus,
Q is an important and unique approach to the study of psychological, sociological, and
educational phenomena.'"
'"J. Guilford, "Three Faces of Intellect." American Psychologist. 14 (1959), 469-479: F. Barron,
"Complexity-Simplicity as a Personality Dimension," Journal of Abnormal and Social Psychology, 48 1953), (
163-172; "The Disposition toward Originality," Journal of Abnormal and Social Psychology, 51 (1955),
478-485.
" At least one adaptation of the Q idea, more or less minus its ipsative feature, has been worked out so that
some of the advantages of Q can be obtained in large-sample surveys. See. for example. E.. Cataldo el al.,
"Card Sorting as a Technique for Survey Interviewing." Public Opinion Quarterly. 34 (1970), 202-215. In
addition, it is possible to construct pencil-and-paper measures that incorporate Q ideas. See D. Jackson and C.
Bidwell, Modification of Q-Technique." Educational and Psychological Measurement. 29 (1959), 221-
"A
232; H. Webster, "A Forced-Choice Figure Preference Test Based on Factorial Design," Educational and
Psychological Measurement. 29 (1959), 45-54. Both methods are ingenious and potentially effective.
'^For a supplementary discussion in considerable depth of the place of Q in behavioral research, see Kerlin-
ger. "Q Methodology in Behavioral Research," op. cit.
522 • Methods of Observation and Data Collection
Study Suggestions
There are unfortunately very few widely available references on Q methodology. Brown's
1.
ents typed and reproduced: civil rights, children's interests. Supreme Court, poverty program,
Jews, socialized medicine: private property, education as intellectual training, subject matter, free
enterprise, school discipline, religion. The first six referents have been found to be liberal referents
and the last six to be conservative referents (see text). Select ten members of the class at random to
rank order the referents according to positive and negative feelings about them. (Rank 1 for in- ,
stance, can be "strongly positive" and Rank 12 "strongly negative") Then intercorrelate the rank
orders using the rank-order coefficient of correlation (see footnote 2) to produce a 10 x 10 correla-
tion matrix. See if any of the class members cluster together judged by substantial correlations
( s Are there two clusters? If so, go back to the original rank orders of the members of the
.40).
cluster to identify what is common to the cluster members. (A simple way to do this is to add the
values given each item by the members of a cluster. Then rank order these sums. The nature of the
cluster may be deduced from its rank order on this sums set of ranks.)
(Note: If the attitudes of the ten individuals are homogeneous, you may obtain only one cluster.
It may be necessary to go outside the class for greater heterogeneity of attitudes.)
" It is useful, in recording an individual's g-sort data, to write the values of the pile placements
on the backs
of the cards with the individual's initials, and numbers of an individual in the
being careful to record the initials
same relative position on each card. With structured sorts, record the structure category symbols on the back of
each card. Number the faces of the cards with random numbers through n. n being the number of cards in the
1
deck. There are more elaborate systems for sorting and recording data —
for example, racks for sorting and
scoring sheets for entering pile placement values —
but these are not recommended.
PART TEN
M ULTIVARIATE
APPROACHES
AND ANALYSIS
Introduction
do multivariate analysis. As we will see later when we study the computer, it is not
enough to have a smattering of the methods and to let the computer do the rest. We
must penetrate the surface and know, both intuitively and analytically, the rationale
and working of multivariate research problems and their analysis.
Multivariate analysis is a general term used to describe a group of mathematical
and statistical methods whose purpose is to analyze multiple measures of A' individu-
alistically into phenomena. The influence is profound: the very nature of the problems
that behavioral scientists study changes radically.
Because they are the basic ingredients, so to speak, of most multivariate methods,
we examine multiple regression and factor analysis in more depth and detail than we
do other multivariate methods. Chapter 33 examines the foundations of multiple re-
gression and the interpretation of its results, and Chapter 34 explores its application
to analysis of variance and covariance and its use in path analysis. In Chapter 34 we
also briefly examine discriminant analysis, canonical correlation, and multivariate
analysis of variance. Unfortunately, we have the space to do little more than charac-
terize these analytic methods. Chapter 35 on factor analysis is one of the most impor-
tant in the book. It attempts to present some idea of the sweep and even grandeur of
the queen of methodologies, factor analysis. Finally, in Chapter 36 we face our great-
est challenge: analysis of covariance structures. In this most ambitious approach, vir-
tually all other multivariate methods, but especially factor analysis and multiple re-
gression, are combined and generalized and the powerful idea of latent variables used
as a functional part of the system. We end the substantive and methodological discus-
sion of the book fittingly, in other words, with consideration of an analytic system
that makes possible the conceptualization and empirical testing of complex behavioral
theories and alternative hypotheses.
Chapter 33
Multiple Regression
Analysis: Foundations
Multiple regression analysis is a method for studying the effects and the magnitudes of the
effects of more than one independent variable on one dependent variable using principles
of correlation and regression. We turn immediately to research and defer explanation until
later.
'L. Lave and E. Seskin. "Air Pollution and Human Health,'" Science. 169 (1970), 723-733.
-Recall that squaring a correlation coefficient yields an estimate of the amount of variance shared by two
variables. This notion is used a great deal in regression analysis.
528 • Multivariate Approaches and Analysis
mortality, accounted for by the two independent variables. The R-'s between mortality
due to bronchitis, on the one hand, and air pollution and socioeconomic status, on the
other hand, ranged from .30 to .78 in different samples in England and Wales, indicating
substantial relations. The R~'s for the dependent variables, lung cancer and pneumonia
mortalities, were similar. Multiple regression also enables the researcher to learn some-
thing of the relative influences of independent variables. In most of the samples, air
pollution was more important than socioeconomic status. As a "control"" analysis. Lave
and Seskin studied other cancers that would presumably not be affected by air pollution.
The R''s were consistently lower, as expected. Extension of the research to metropolitan
areas in the United States yielded similar results."^
In a study of the prediction of high school GPA (grade-point average), Holtzman and
Brown used two independent (SHA) and
variable measures; study habits and attitudes
scholastic aptitude (SA)."* The between high school GPA (the dependent
correlations
variable) and SHA and SA in grade 1 (N = 1684) were .55 and .61. The correlation
between SHA and SA was .32. How much more variance was accounted for by adding the
scholastic-aptitude measure to the study-habits measure? If we combine SHA and SA
optimally to predict GPA, we obtain a correlation of .72. The answer to the question,
then, is .72" — .55" = .52 - .30 = .22, or 22 percent more of the variance of GPA is
accounted for by adding SA to SHA.
These are examples of multiple regression analysis. The basic idea is the same as
simple correlation except that k. where k is greater than 1 independent variables are used
,
used to predict Y. The method and the calculations are done in a manner to give the
"best"" prediction possible, given the correlations among all the variables. In other words,
instead of saying: If X, then Y, we say: If Xi , X^ then Y. and the results of the
Xf,.
calculations tell us how "good"" the predictionand approximately how much of the
is
however, the better the prediction. If r = 1 .00, then prediction is perfect. To the extent
^Students should bear in mind our earlier discussions of the difficulty in interpreting nonexperimental
results (Chapter 22).Lave and Seskin, however, have built a strong case, even though some of their interpreta-
tion was questionable. At this point, it would be wise for readers to turn back to the discussion. "Multivariate
Relations and Regression," in Chapter 5.
"W. Holtzman and W. Brown, "Evaluating the Study Habits and Attitudes of High School Students,"
Journal of Educational Psychology. 59 (1968), 404-409.
Multiple Regression Analysis: Foundations • 529
that the correlation departsfrom .00, to that extent predictions from X to Y are less than
1
perfect. If we and Y values when r = 1.00, they will all lie on a straight line.
plot the X
The higher the correlation, the closer the plotted values will be to the regression line (see
Chapter 5).
To illustrate and explain the notion of statistical regression, we use two fictitious
examples with simple numbers. The numbers used in the two examples are the same
except that they are arranged differently.'' The examples are given in Table 33.1. In the
example on the left, labeled A, the correlation between the X and Y values is .90, while in
the example on the right, labeled B, the correlation is 0. Certain calculations necessary for
regression analysis are also given in the table: the sums and means, the deviation sums of
squares of X and Y {Hx- = SX" - C^Xy^ln), the deviation cross products (Zxs,' = 'LXY —
(1X)(1Y)in). and certain regression values to be explained shortly.
First, note the difference between the A and B sets of scores. They differ only in the
order of the scores of the second or X columns. The two different orders produce very
different correlations between the X and Y scores. In the A set, r = .90, and in the B set,
r = .00. Second, note the statistics at the bottom of the table. S.v" and Sv' are the same in
both A and B, but -.xt is 9 in A and in B. Let us concentrate on the A set of scores.
The basic equation of simple linear regression is:
Y' = a + bX (33.1)
A.
530 • Multivariate Approaches and Analysis
a prediction formula: Y values are predicted from X values. The correlation between the
observed X
and Y values in effect determines how the prediction equation "works." The
intercept constant, a. and the regression coefficient, b. will be explained shortly.
The two sets of X and Y values of Table 33. 1 are plotted in Figure 33. 1 Lines have .
been drawn in each plot to "run through" the plotted points. If we had a way of placing
these lines so that they would simultaneously be as close to all the points as possible, then
the lines should express the regression of Y on X. The line in the left plot, where r = .90,
runs close to the plotted XY points. In the right plot, however, where r = .00, it is not
possible to run the line close to all the points. The points, after all, are in effect placed
randomly, since r = .00.
The between X and Y, r = .90 and r = .00, determine the slopes of the
correlations
regression lines (when the standard deviations are equal, as they are in this case). The
slope indicates the change in Y with a change of one unit of X. In the r = .90 example,
with a change of 1 in X, we predict a change of .90 in Y. (This is expressed trigonometri-
cally as the length of the line opposite the angle made by the regression line divided by the
length of the line adjacent to the angle. In Figure 33. if we drop a perpendicular from the
1 ,
regression line —
the point where the X and Y means intersect, for example to a line —
drawn horizontally from the point where the regression line intersects the Y axis, or at
Y= -.60, then 3.6/4.0 = .90. A change of in X means a change of .90 in Y.^)
1
The plot of the X and Y values of Example B, right part of Figure 33.1, is quite
different. In Example A, one can rather easily and visually draw a line through the points
and achieve a fairly accurate approximation to the regression line. But in Example B this
is hardly possible. We can draw the line only by using other guidelines, which we get to
shortly. Another important thing to note is the scatter or dispersion of the plotted points
around the two regression lines. In Example A, they cling rather closely to the line. If
r = 1.00, they would all be on the line. When r = .00, on the other hand, they scatter
widely about the line. The lower the correlation, the greater the scatter.
A: r=.90 B: r = .00
-®- y= 3 + (0)x
12 3 4 5 6
Figure 33.1
'Raw scores have been used for most of the examples in this chapter because they fit our purposes better. A
thorough treatment of regression, however, requires discussions using deviation scores and standard scores. The
emphasis here, as elsewhere in the book, is on research uses of the methods and techniques and not on statistics
as such. The student should supplement his study, therefore, with good basic discussions of simple and multiple
regression. See the references in the study suggestions at the end of the ne,xt chapter.
Multiple Regression Analysis: Foundations • 531
S.rv'
The two b's are .90 and .00. The intercept constant, a, is calculated with the formula:
a = Y-bX (33.3)
The as for the two examples are -.60 and 3; e.g., for Example A, a = 3 — (.90)(4) =
— .60. The intercept constant is the point where the regression line intercepts the Y axis.
To draw the regression line, lay a ruler between the intercept constant on the Y axis and the
point where the mean of Y and the mean of X meet. (In Figure 33.1, these points are
indicated with small circles.)
The final steps in the process, at least as far as it will be taken here, are to write
regression equations and then, using the equations, calculate the predicted values of Y. or
y", given the X values. The two equations are given in the last line of Table 33.1. First
look at the regression equation for r = .00: Y' = 2 + {0)X. This means, of course, that all
We can now calculate the predicted values of Y. The higher the correlation, the more
accurate the prediction. The accuracy of of scores can be
the predictions of the two sets
clearly shown by calculating the differences between the original Y values and the pre-
dicted Y values, or Y - Y' = d. and then calculating the sums of squares of these differ-
ences. Such differences are called residuals. In Table 33. 1 the two sets of residuals and
,
their sums of squares have been calculated (see columns labeled d). The two values of
1.d-, 1.90 for A and 10.00 for B, are quite different, just as the plots in Figure 33. 1 are
quite different: that of the B, orr = .00, set is much greater than that of the A, orr = .90,
set. That is, the higher the correlation, the smaller the deviations from prediction and thus
the more accurate the prediction.
Y. Earlier in the book we talked about the great need to assess the influence of several
variables on a dependent variable. We can, of course, predict from verbal aptitude, say, to
532 • Multivariate Approaches and Analysis
reading achievement, or from conservatism to ethnic attitudes. But how much more pow-
erful it would be if we
could predict from verbal aptitude together with other variables
—
known or thought to influence reading for example, achievement motivation and atti-
tude toward school work. Theoretically, there is no limit to the number of variables we
can use, but there are practical limits. Although only two independent variables are used
in the example that follows, the principles apply to any number of independent variables.
An Example
Take one of the problems just mentioned. Suppose we had reading achievement (RA),
verbal aptitude (VA), and achievement motivation (AM) scores on 20 eighth-grade pupils.
We want to predict to reading achievement, Y. from verbal aptitude. X|. and achievement
motivation, X2. Or, we want to calculate the regression of reading achievement on both
verbal aptitude and achievement motivation. If the scores on verbal aptitude and achieve-
ment motivation were standard scores, we might average them, treat the averages as one
composite independent variable, and calculate the regression statistics as we did earlier.
We might not do too badly either. But there is a better way.
Suppose the X], verbal aptitude, Xt. achievement motivation, and Y, reading achieve-
ment, scores of the 20 subjects and the sums, means, and raw score sums of squares are
those of Table 33.2 (Disregard the Y' and d columns for the moment.) We need to
calculat e the deviatio n sums of squares, the deviation cross products, the standard devia-
tions [\/l,x'/(N — 1)], and the correlations among the three variables. These are the basic
X,
Multiple Regression Analysis: Foundations • 533
dent variables and the dependent variable and a set of weights called beta weights, Pj,
which will be explained later (they are like the b weights). The normal equations for the
above problem are:
where ^j = beta weights; r,j = the correlations among the independent variables; and
ryj= the correlations between the independent variables and the dependent variable, Y.
(Note that r^z = rjx, and that r,, = r22 = 100. Note, too, that Equation 33.5 can be
extended to any number of independent variables.)
Probably the best way —
most elegant way
certainly the
to solve the equations for —
the (ij is knowledge of matrix algebra cannot be
to use matrix algebra. Unfortvinately,
assumed. So the actual solution of the equations must be omitted. The solution yields the
following beta weights: )3i = .6123 and /32 = .2357. The b weights are then obtained
from the following formula:
bj = lij
— (33.6)
where Sj = standard deviations of variables 1 and 2 (see Table 33.3) and 5,, = standard
deviation of Y. Substituting in Equation 33.6 we obtain:
/ 2.9469 \
/?, = (.6123) = .6777
V 2.6651/
/ 2.9469 \
b^ = (.2357) = .3934
V 1.7652/
To obtain the intercept constant, extend Equation 33.3 to two independent variables:
a = Y-biXt - bnX.
a = 5.50 - (.677l)(4.95) - (.3934)(5.20) = .W27
Finally, we write the complete regression equation:
These values and the other eighteen values are given in the fourth column of Table 33.2.
The fifth column of the table gives the deviations from regression, or the residuals,
K, - Y', = di. For example, the residuals for Y^ and K20 are:
on X| and X2 must be considered. Square each of the Y' values of the fourth column of
Table 33.2 and sum:
(3.0305)- + • • •
+(5.5649)- = 688.3969
Now use the usual formula for the deviation sum of squares (see chap. 13):
Id" = (-1.0305)- + • • •
+(4.4351)- = 81.6091'°
As a check, calculate:
The regression and residual sums of squares are not usually calculated in this way.
They were so calculated here to show just what these quantities are. Had the formulas that
are ordinarily used been used, we might not have clearly seen that the regression sum of
squares is sum of squares of the Y' values calculated by using the regression equation.
the
We also might not have seen clearly that the residual sum of squares is the sum of squares
calculated with the d's of the fifth column of Table 33.2. Recall, too, that the a and the b's
(or j8"s) of the regression equation were calculated to satisfy the least-squares principle,
that is, to minimize the ^'s, or errors of prediction — or, rather, to minimize the sum of the
squares of the errors of prediction. To summarize: The regression sum of squares ex-
presses that portion of the total sum of squares of Y that is due to the regression of Y, the
dependent variable, on X, and X2. the independent variables, and the residual sum of
squares expresses that portion of the total sum of squares of Y that is not due to the
regression.
The reader may wonder: Why bother with this complicated procedure of determining
the regression weights? Is it necessary to invoke a least-squares procedure? Why not just
average the Xi and Xj values and call the means of the individual Xi and X2 values the
predicted Y's7 The answer is that it might work quite well. Indeed, in this case it would
work very well, almost as well, in fact, as the full regression procedure. But it might not
work too well. The trouble is that you do not really know when it will work well and when
it will not. The regression procedure always "works," other things equal. It always
minimizes the squared errors of prediction. Notice that in both cases linear equations are
used and that only the coefficients differ:
Of the innumerable possible ways of weighting Xi and X2, which should be chosen if
the least-squares principle is not used? It is conceivable, of course, that one has prior
knowledge or some reason and Xt. Xi may be the scores on some test
for weighting Xj
that has been found to may be a successful predic-
be highly successful in prediction. X2
tor, too, but not as successful as Xi Therefore one may decide to weight Xi very heavily,
.
say four times as much as X2. The equation would be: Y' = 4X| + X2. And this might
'"This is a the errors that cumulate through rounding. The actual regression sum of
"good" example of
squares, calculated by computer, 83.3909, an error of .006. Note, however, that even though the residuals
is
were calculated from the hand-calculated predicted Y's, the sum of squares of the residuals is exactly that
produced by the computer, 81.6091.
536 • Multivariate Approaches and Analysis
work well. The trouble seldom do we have prior knowledge, and even when we do,
is that
it is rather imprecise. How
can the decision to weight X^ four times as much as Xi be
reached? An educated guess can be made. The regression method is not a guess, however.
It is a precise method based on the data and on a powerful mathematical principle. It is
the links that bind together the various aspects of multiple regression and analysis of
variance. The formula for R that expresses the first sentence of this paragraph is:
Vil,V 2,V
R- =
^^
(Svv')-
=
(33.10)
Using the Y and Y' values of Table 33.2, we obtain: R^ .5054 and R = V.5054
.7109."
"Calculating these values is a good exercise. We already have Sv" = 165. Then calculate:
^^ ' = 83.3939
It can be shown algebraically that Sv''equals 2vv'. The difference of .003 is due to rounding errors.
Multiple Regression Analysis: Foundations • 537
«- = ^^2^ (33.11)
ss,
where ss, is, as usual, the total sum of squares of Y. or 2v,-. Substituting the regression
''Don't underestimate the importance of doing such calculations and pondering their meaning. This is
especially important in helping to understand multiple regression and other multivariate techniques. It can be a
serious mistake to let the computer do everything for us, especially with package programs. Use a programmable
calculator-computer or a microcomputer. For the simpler and the various sums of squares, write
statistics, like r
relatively simple programs, store them on floppy disks of microcomputers or on the plastic slides of programma-
ble calculators, and use them when needed. The nature and use of the computer, especially the microcomputer or
the so-called personal computer, will be discussed in more detail in Appendix B.
538 • Multivariate Approaches and Analysis
sum of squares calculated earlier by Formula 33.7, and the total sum of squares from
Table 33.3. we obtain:
83.3912
R~ = = .5054
165.0000
And /?" is seen to be that part of the Y sum of squares associated with the regression of Y
on the independent variables. As with all proportions, multiplying it by 100 converts it to
a percentage.
Formula 33. 1 1 provides another link to the analysis of variance. In Chapter 13 on the
foundations of analysis of variance, a formula for calculating E, the so-called correlation
ratio, was given (Formula 13.4). Square that formula:
E' =
ss,
where ssj, = the between-groups sum of squares, and 55-, = total sum of squares, ss^, is the
sum of squares due to the independent variable, ii^g is the sum of squares due to regres-
sion. Both terms refer to the sum of squares of a dependent variable due to an independent
variable or to independent variables.'^
questions can be asked about individual regression coefficients. In this chapter and the
next, F tests will be used almost exclusively. They fit in nicely with both regression
analysis and analysis of variance, and they are both conceptually and computationally
'^
simple.
«c' = 1 - (1 - /?")
N- n
where R^r = shrunken or corrected R-; N = size of sample; n = total number of variables in the analysis. Using
this formula, the R- in the example reduces to .45.
"E.g.. G. Snedecor and W. Cochran, Statistical Methods. 6th ed. Ames, Iowa: Iowa State University Press,
1967, Table A 1 1, p. 557. A readily available and useful book of statistical tables is: R. Burington and D. May,
Handbook of Probability and Statistics with Tables. 2d ed. New York: McGraw-Hill, 1970. It should be pointed
out. however, that statistical tables are probably obsolescent because routines for calculating the p's (probabili-
ties) of r's. F. and t ratios, and other statistics can be and are written for computer programs, e.g.. N. Jaspen,
"The Calculation of Probabilities Corresponding to Values of r, t. F and Chi Square," Educational and Psycho-
logical Measurement. 25 (1965), 877-880.
'''Consideration of / tests of regression coefficients must be omitted because they require matrix algebra
calculations beyond our reach. A t test of a regression coefficient, if significant, indicates that the regression
weight differs significantly from zero, which means that the variable with which it is associated contributes
significantly to the regression, the other independent variables being taken into account.
Multiple Regression Analysis: Foundations • 539
F= -=^ •^•5res'"/2
(33.12a)
F= '-^
(33.12b)
ss^J(N-k-l)
where ss^^^ = sum of squares due to regression; ss^es — residual or error sum of squares;
k = number of independent variables; A^ =
sample size. If dfi and dfj, the degrees of
freedom for the numerator and denominator of the F ratio in Equation 33.12a, are de-
fined. Equation 33. 12b results. It is important because it is a formula to test the signifi-
cance of any multiple regression problem. Using the values calculated earlier for the
example of Table 33.2, now calculate:
F=
83.3912/2
= 41.6956 = 8.686
81.6091/(20- 2-1) 4.8005
sion, which is used as an error term, analogous to the within-groups mean square, or error
variance. The basic principle, again, is always the same: variance due to the regression of
K on Xi, Xt. Xj, or, in analysis of variance, due to the experimental effects, is
. . .
evaluated against variance presumably due to error or chance. This basic notion, elabo-
rated at length in earlier chapters, can be expressed:
R'/k
F= ; (33.13)
(1 - R^)/{N - k- I)
where k and N are the same as above. For the same example:
.5054/2 .2527
F = = = 8.684
(1 - .5054)/(20 - 2 - 1) .0291
which is the same as the F value obtained with Equation 33. 12, within errors of rounding.
At 2 and 17 degrees of freedom, it is significant at the .01 level. This formula is particu-
larly useful when our research data are only in the form of correlation coefficients. In such
a case, the sums of squares required by Equation 33. 12 may not be known. Much regres-
sion analysis can be done using only the matrix of correlations among all the variables,
independent and dependent. Such analysis is beyond the scope of this book. Nevetheless,
the student of research should be aware of the possibility.'^
cult than the interpretation of the univariate statistics studied earlier. We therefore go into
the interpretation of the statistics of our example in some depth.
The F ratio of 8.684 calculated above tells us that the regression of Y on X, and Xi,
expressed by /?y 12, is statistically significant. The probability that an F ratio this large will
occur by chance is less than .01 (it is actually about .003), which means that the relation
between Y and a least-squares combination of Xj and Xi could probably not have occurred
by chance.
/? = .71 can be interpreted much like an ordinary coefficient of correlation, except
that the values of R range from to 1 .00, unlike r, which ranges from - 1 .00 through
to 1.00. R^ = .71" = .51 is more meaningful and useful, however. It means that 51
percent of the variance of y is accounted for, or "determined," by X, andX2 in combina-
tion. It is accordingly called a coefficient of determination.
Let us ask, somewhat diffidently, a more difficult question: What are the relative contri-
butions of Xi and X2, of verbal aptitude and achievement motivation, to Y, reading
achievement? The restricted scope of this book does not permit an examination of the
'^
answers to this question in the detail it deserves.
One would think that the regression weights, b or )3, would provide us with a ready
We can say that Xi, verbal aptitude, is weighted more heavily than X2. achievement
motivation. This happens to be true in this case, but it may not always be true, especially
with more independent variables.
Regression coefficients, unfortunately for interpretative purposes, are not stable. They
change with different samples and with addition or subtraction of independent variables to
the analysis.'** There is no absolute way to interpret them. If the correlations among the
"The problem of the relative contribution of independent variables to a dependent variable or variables is
one of the most complex and difficult of regression analysis. It seems that no really satisfactory solution exists,
at least when the independent variables are correlated. Nevertheless, the problem cannot be neglected. The
reader should bear in mind, however, that considerable reservation must be attached to the above and later
discussions. The technical and substantive problems of interpretation of multiple regression analysis are dis-
cussed in two or three of the references given in Study Suggestion at the end of the next chapter.
1
'*See R. Darlington "Multiple Regression in Psychological Research and Practice," Psychological Bulle-
tin, 69 (1968), 161-182; R. Gordon, "Issues in Multiple Regression," American Journal of Sociology, Ti
independent variables are all zero or near- zero, interpretation is greatly simplified. But
many or most variables that are correlated with a dependent variable are also correlated
among themselves. The example of Table 33.3 shows this: the correlation between X| and
X2 is .26, a modest correlation, to be sure. Such intercorrelations are often higher, how-
ever. And the higher they are, the more unstable the interpretation situation.
The ideal predictive situation is when the correlations between the independent varia-
bles and the dependent variable are high, and the correlations among the independent
variables are low. This principle is important. The more the independent variables are
intercorrelated, the more Among other things, one has greater
difficult the interpretation.
difficulty telling the relative influenceon the dependent variable of the independent varia-
bles. Examine the two fictitious correlation matrices of Table 33.4 and the accompanying
/?-'s. In the two matrices, the independent variables, X\ and Xt, are correlated .87 and
.43, respectively, with the dependent variable, Y. But the correlations between the inde-
pendent variables are different in the two cases. In matrix A, ri2 = .50, a substantial
correlation. In matrix B, however, rji = 0.
The contrast between the /?-'s is dramatic: .76 for A and .94 for B. Since, in B, X, and
Xi are not correlated, any correlations they have with Y contribute to the prediction and
the R'.
''^
When the independent variables are correlated, as in matrix A (r^ = .50), some
of the common variance of K and Xi is also shared with Xt. In short, X] and X2 are to some
extent redundant in predicting Y. In matrix B there is no such redundancy.
The situation is clarified, perhaps, by Figure 33.2. Let the circles stand for the total
variance of Y. and let this total variance be 1 .00. Then the portions of the variance of Y
accounted for by X| and X2 can be depicted. In both circles, the horizontal hatching
indicates the variance accounted for by X|, or Vx,. and the vertical hatching Xi, or V^,-
(The variances remaining after Vx, and V.v, are the residual variances, labeled V^es in the
figure.) In B, V.v, and V^, do not overlap. In A, however, Vx, and V.v, overlap. Simply
because r^ = in B and rii = 50 in A, the predictive power of the independent varia-
bles is much greater in B than in A. This is, of course, reflected by the /?^'s: .76 in A and
.94 in B.
While this is a contrived and artificial example, it has the virtue of showing the effect
of correlation between the independent variables, and thus it illustrates the principle
enunciated above. It also reflects the difficulty of interpreting the results of most regres-
sion analysis, since in much research the independent variables are correlated. And when
more independent variables are added, interpretation becomes still more complex and
difficult. A central problem is: How does one sort out the relative effects of the different
V,.. = .24
.76 R^yU
,, = .94
V + V
y-n
.76 + = .76 .76 + .18 = .94
Figure 33.2
X's on y? The answer is also complex. There are ways of doing so, some more satisfying
~°
than others, but none completely satisfactory.
^"Perhaps the most satisfactory way, at least in the author's opinion and experience, is to calculate so-called
squared semipartial correlations (also called part correlations). These are calculated with the formula;
which indicates the contribution to the variance of after X, has been taken into account. The same
Y of X2
calculation for A yields: .76 - .76 = 0. which indicates that X2 contributes nothing to the variance of Y. after X,
has been taken into account. (Actually, there is a slight increase that emerges only with a large number of
decimal places.)
The student is referred to the articles by Darlington and Gordon cited earlier for discussions of the problems
involved. Kerlinger and Pedhazur also discuss the problem in considerable detail and relate it to research
weight applies are held constant. For example, jS,: 21. or j8i in a three-variable (indepen-
dent variable) problem, is the standard partial regression weight, which expresses the
change in Y due to change in X|, with variables 2 and 3 held constant."' The b weights,
too, are partial regression coefficients, but they are not in standard form.
Another problem is that in any given regression, R, R^, and the regression weights will
be the same no matter what the order of the variables. If one or more variables are added
or subtracted from the regression, however, these values will change. And regression
weights can change from sample to sample. In other words, there is no absolute quality
about them. One cannot say, for instance, that because verbal and numerical aptitudes
have, say, regression weights of .60 and .50 in one set of data, they will have the same
values in another set.
Another important point is that there usually is limited usefulness to adding new
variables to a regression equation. Because many variables of behavioral research are
correlated, the principle illustrated by the data of Table 33.4
and discussed earlier operates
so as to decrease the usefulness of additional variables. one finds three or four indepen- If
dent variables that are substantially correlated with a dependent variable and not highly
correlated with each other, one is lucky. But it becomes more and more difficult to find
other independent variables that are not in effect redundant with the first three or four. If
/?? 123 = 50, then it is unlikely that /?v 1234 will be much more than .55, and /?? 12345 will
probably not be more than .56 or .57. We have a regression law of diminishing returns,
^^
which will be illustrated in the next section when actual research results are discussed.
It was said above that R, R~, and the regression coefficients remain the same, if the
same variables are entered in different orders. This should not be taken to mean, however,
that the order in which variables enter the regression equation does not matter. On the
contrary, order of entry can be very important. When the independent variables are corre-
lated, the relative amount of variance of the dependent variable that each independent
variable accounts for or contributes can change drastically with different orders of entry of
the variables. With the A if we reverse the order of Xi
data of Table 33.4, for example,
and X2. change rather markedly. With the original order, Xj
their relative contributions
contributed nothing to /?', whereas with the order reversed X2 becomes Xi and contributes
19 percent [r = (.43)" = .19)] to the total R~. and the original Xi, which becomes Xt,
contributes 57 percent (.19-1- .57 = .76). The order of variables, while making no differ-
ence in the final R- and thus in overall prediction, is a major research problem.
^' A second meaning, used in theoretical work, is that )3 is the population regression weight which b esti-
mates. We omit this meaning. /3's can be translated into fc's with the formula:
=
*. ft
7
where Sy = standard deviation of Y and s^ = standard deviation of variable j.
^^When independent variables are added, one notes how much they add to /?* and tests their statistical
significance. The formula for doing so, much like formula 33.13, is:
where k, = number of independent variables of the larger R-. ti = number of independent variables of the
smaller/?-, and N= number of cases. This formula Although an F calculated like this may be
will be used later.
statistically significant, especially with a large sample, the actual increase in R' may be quite small. In an
example presented later in the chapter (Layton and Swanson's study), the addition of a sixth independent
variable yielded a statistically significant F ratio, but the actual increase in R^ was .0147! The difference
between the /?" s in the numerator is the squared semipartial correlation coefficient mentioned in footnote 20.
544 • Multivariate Approaches and Analysis
RESEARCH EXAMPLES
We take the data of a relatively simple study to show insome detail the importance of the
order of entry of independent variables. Layton and Swanson used the subtests of the
o Before Ban
:
X After Ban
:
74 76
Years
Figure 33.3
-'J. Grier, "Ban of DDT and Subsequent Recovery of Reproduction in Bald Eagles," Science. 218 (1982),
1232-1234.
"''The method of comparing slopes statistically is given in Pedhazur, op. cil.. pp. 438ff. The simple regres-
sions are calculated using years as the independent variable and reproduction rates as the dependent variable.
The correlation before the DDT ban was -.74, but after the ban it was .80 (my calculations).
''The regression before the ban was calculated through 1974 because the effect of the ban could not have
been expected to manifest itself for about a year. Grier did his calculation through 1973 (Grier, op. cii., foot-
note 11).
Multiple Regression Analysis: Foundations • 545
Differential Aptitude Test to predict rank in high school. ~*Using the correlation matrix
thatLayton and Swanson calculated among the six DAT subtests and high school rank
(628 boys in 27 schools), three different orders of entry were used in multiple regression
analyses.
To simplify matters, we report only the results with the first four of the six DAT
subtests: Verbal Reasoning (VR), Numerical Aptitude (NA), Abstract Reasoning (AR).
and Space Relations (SR). And we report only the R''s of the first variable entered and the
differences between the successive R'n —
that is 7?^ ,• ^v 12 ~ f^l \- '^v 123 ~ /^y 12^ iind
'fy.1234 ~ ^y.i23- Thcsc differences show the contributions, respectively, of X, alone, of
Xi after subtracting the effect of X,, of X, after subtracting the effect of X, and X^, and,
finally, of X4 after subtracting the effect of X,, X2, and X,. These indices are squared
semipartial (part) correlations (see footnotes 20 and 22). They can be interpreted, with
circumspection, as indices of the variance contributions of each of the variables —
in that
particular order.
The squared semipartial correlations, which are percentages of the total variance (with
the particular order of entry of the variables), are given in Table 33.5. The differences are
pronounced. VR, for instance, which accounts for 31 percent of the total variance (with
all six independent variables) in the first order of entry, accounts for only 5 percent in the
third order. AR, which accounts for almost none of the variance in the first order when it
is the third independent variable, jumps to 20 percent in the second order when it is the
first independent variable. Obviously, the order of entry of variables in the regression
equation is highly important. The reader should note other similar differences —
for exam-
ple, VR.
If readers feel a bit baffled by the problem of the order of entry of variables, they can
hardly be blamed. Indeed, they have company among experts in the field. Actually, there
is no "correct"" method for determining the order of variables. A researcher may decide
that he will let the computer choose the variables in order of the size of their contributions
to the variance of Y. For some problems this may be satisfactory; for others, it may not.
As always, there is no substitute for depth of knowledge of the research problem and
concomitant knowledge and use of the theory behind the problem. In other words, the
research problem and the theory behind the problem should determine the order of entry
of variables in multiple regression analysis.
may be a variable that is conceived as acting
In one problem, for instance, intelligence
compensatory methods and social class, say, to produce
in concert with other variables,
changes in verbal achievement. Intelligence would then probably enter the equation after
compensatory methods and before (or after) social class. A researcher doing this would be
Order 1:
546 • Multivariate Approaches and Analysis
influenced by the notion of interaction: the compensatory methods differ in their effects at
different levels of intelligence. Suppose, however, that the researcher wants only to con-
trol intelligence, to eliminate its influence on verbal achievement. The theory underlying
his reasoning may say nothing about an interaction between intelligence and other varia-
bles. But the researcher knows that it will certainly influence verbal achievement and he
wants its influence eliminated before the effects of compensatory education and social
class are assessed. In this case he would treat intelligence as a covariate and enter it into
the regression equation first.
Earlier in this book it was said; "Design is data discipline." The design of research
and the analysis of data spring from the demands of research problems. Again, the order
of entry of independent variables into the regression equation is determined by the re-
search problem and the design of the research, which is itself determined by the research
problem.
Although the order of entry of variables and the changes in regression weights that can
occur with different samples are difficult problems, one must remember that regression
weights do not change with different orders of entry. This is a real compensation, espe-
cially useful in prediction. In many research problems, for example, the relative contribu-
tion of variables is not a major consideration. In such cases, one wants the total regression
equation and its regression weights mainly for prediction and for assessing the general
nature of the regression situation.
As an illustration of an entirely different kind of research and data and of the effect of high
correlations among independent variables —
and thus high redundancy of predictors
consider a study of regression of political development (of 77 nations), Y, on communica-
tion, A"), urbanization, Xo, education, X3, and agriculture, X4.-'' The lowest r between
independent variables was .69 and the highest .88. There is obviously considerable redun-
dancy. This is clearly shown by calculating /?v ^y 12- /?? 123- and/?? 1234- They are: .66,
, •
.67, .67, .67! Efficiency of prediction is as good with one independent variable, X,, as it
is with all four independent variables! This state of high correlations among independent
We now examine two highly important studies of equality of educational opportunity, one
in the United States and the other in Poland.-** We study the first of these, the justly
famous Coleman Report, in some depth because it is interesting, instructive, and highly
controversial, and because it is perhaps the most important single and massive educational
study of three decades.
One of the basic purposes of the study was to explain school achievement, or rather,
inequality in school achievement. Multiple regression analysis was used in a complex
manner to help do this. The researchers chose as one of their most important dependent
•'p. Cutright, "National Political Development: Measurement and Analysis," American Sociological Re-
view, 27 (1963), 229-245. Although Cutrighl supplied regression statistics, I calculated the /?'s reported above.
The student can profit from studying Cutright's solutions to interesting measurement problems.
"^*J. Coleman et al.. Equality of Educational Opportunity. Washington. D.C.: Dept. of Health, Education,
and Welfare, Office of Education (U.S. Govt. Printing Office), 1966; A. Firkowskaet al., "Cognitive Develop-
ment and Social Policy," Science. 2(X) (1978), 1357-1362.
Multiple Regression Analysis: Foundations • 547
variable measures verbal ability or skill (VA). Some 60 independent variable measures
were correlated with VA. From the many correlations reported by the authors, several
were selected from those for the total Northern white and black samples (in Appendix
9.10), and multiple regression analyses were done.
Five independent variables were chosen to predict to VA because of their presumed
importance. They are listed in the footnote of Table 33.6. The /?-'s, beta weights, )3, and
the squared semipartial correlations {SP') of the regression analysis of two samples of
more than 100,000 each of Northern white and Northern black twelfth-grade pupils are
given in Table 33.6. In addition to the comparisons between the white and black sample
results, the variables have been entered into the regression equation in three different
orders of entry. The R-'s and the jS's for the three orders, of course, are the same, since
changing the order does not change R~ and the )3's, as we learned earlier.
Most of the variance of verbal ability appears to be due to Self-Concept, a measure
constructed from three questions, the answers to which reveal how the pupil perceives
himself (f.g.. "I sometimes feel that I just can't learn"). Study the ^/'-'s and see that this
is true in all orders of entry for whites (.214, .139. .218). it is less true for blacks (.111,
.063, .120). The only other variable that accounts for a substantial amount of variance
( 3: . Control of Environment. CE, which is another variable involving the concept
10) is
of self and adding the notion of control over one's fate. Here whites and blacks are
similar, except that CE appears to be somewhat weightier for blacks.
One of the most interesting comparisons is that between kinds of variables. SC and CE
are both ""subjective" variables; the pupil projects his own image. The other variables are
Table 33.6 Multiple Regression Analysis: R-'s, Beta Weights, and Squared Semipartial Correla-
tions, Selected Variables from Equality of Educational Opportunity
548 • Multivariate Approaches and Analysis
"objective": they are external to the pupil; they are part of the objective environment, so
to speak. Thiswas an important finding of the study. Where things like tracking (homoge-
neous grouping) and school facilities accounted for little of the variance in achievement,
the so-called attitude variables, two of which were SC and CE, accounted for more
"''
variance for both white and black pupils than any other variables in the study.
The Warsaw Study is highly important not only because it is a good example of
multiple regression analysis but also because it was an attempt to assess the effects of a
massive effort by a government to achieve educational equality, among other things. After
the World War II destruction of Warsaw, the government attempted to equalize social
conditions and educational opportunity by spreading people over different parts of the
city. One of the objectives of the policy was to reduce differences in ability and achieve-
ment due to social class. In other words, if equalization of extrinsic factors of the environ-
ment was accomplished, this should help to wipe out differences in mental performance:
the correlation between social class and intelligence should be zero. The researchers used
two kinds of variables: intrinsic and extrinsic. The latter were presumably equalized by
the policy. Intrinsic variables were parental occupation, parental education, and others.
All children born in 1963 and then (1974) living in Warsaw (about 13,000) were tested
using a well-known nonverbal intelligence test.
Among the many analyses done, we are here concerned only with two of them. In the
multiple regression analysis, the researchers created and analyzed composite district,
school, and family variables. The first two were those presumably affected by the equali-
zation policy. It was believed that the effects of the variables the children bring with them
to school, the family set, would also be equalized. The researchers entered the sets of
variables in three different orders into multiple regression analyses. Summary R- results
are given in Table 33.7. The data in the table are essentially squared semipartial correla-
tions (see footnotes 20 and 22). which are of course affected by the order of entry of the
variables in the regression equation. The total /?-, /?v 123. was .106. The important coeffi-
cients in testing the equalization effects are those of the second data column, the SP^'s.
Order of
Entry
Multiple Regression Analysis: Foundations • 549
The district and school variable sets, no matter the order of entry, amounted to only
.02 (.016 + .006 = .022). As the authors say (p. 1361), the contribution of extrinsic
variables was minor. The program was apparently successful in equalizing the extrinsic
school and district variables. But the intrinsic family variables, again no matter the order
of entry, accounted for most of the total R-. between .08 and .10. Alas, the heroic attempt
to annihilate the influence of what the child brings with him failed. As the authors say a
bit wistfully.
Despite this social policy of equalization, the association persists in a form characteristic of
more traditional societies. Indeed, contrary to expectation, those associations are as strong as
many reported in large-scale studies from Western societies societal changes over a genera-
. . .
tion have failed to override forces that determine the social class distribution of mental perfor-
variable, q a dependent variable, and r, s, and t other independent variables. Other kinds
of statements can, of course, be written e.g., If p and r, then q. In such a case p and r
are two independent variables, both of which are required for q.
The point of all this is that multiple regression can successfully handle such cases. In
most behavioral research there usually one dependent variable, though we are theoreti-
is
cally not restricted to only one. Consequently, multiple regression is a general method of
analyzing much behavioral research data. Certain other methods of analysis can be con-
sidered special cases of multiple regression. The most prominent is analysis of variance,
all types of which can be conceptualized and accomplished with multiple regression
analysis.
It was said earlier that all control is control of variance. Multiple regression analysis
^°Ibid.. p. 1362. Note that the multiple regression analysis was supported by another analysis. The authors
worked out a global index of parental education occupation, which ran from through 2 (see their Table 3 and 1
accompanying discussion). They calculated the mean scores of about 13,000 children on the intelligence test
used for each level of this "home background" index (see their Table 6). The correlation between the index
values and mean intelligence test scores was .98 (my calculation)!
-"
Some of the material in this section was published in my essay, "Research in Education." In R. Ebel. V.
Noll, and R. Bauer, eds.. Encyclopedia of Educational Research. 4th ed. New York: Macmillan, 1969, pp.
1127-1 144. Note that whenever the expression "If p, then q" appears, it should be taken tomean "If p, then
probably q.
550 • Multivariate Approaches and Analysis
plishes this the same way analysis of variance does: by estimating the magnitudes of
different sources of influence on Y, different sources of variance of K, through analysis of
the interrelations of all the variables. It tells how much of Y is presumably due to X^, X2
. . . , Xf;. It gives some idea of the relative amounts of influence of the X's. And it
furnishes tests of the statistical significance of combined influences of X's on y and of the
separate influence of each X. In short, multiple regression analysis is an efficient and
powerful hypothesis-testing and inference-making technique, since it helps scientists
study, with relative precision, complex interrelations between independent variables and a
dependent variable, and thus helps them "explain" the presumed phenomenon repre-
^^
sented by the dependent variable.
'^Study suggestions for this chapter are given at the end of the next chapter.
Chapter 34
Multiple Regression,
Analysis of Variance,
and Other
Multivariate Methods
Close examination shows the conceptual bases underlying different approaches to data
analysis to be the same or similar. The symmetry of the fundamental ideas has great
aesthetic appeal, and is nowhere more interesting and appealing than in multiple regres-
sion and analysis of variance. Earlier, in discussing the foundations of analysis of vari-
ance, the similarity of the principles and structures of analysis of variance and so-called
correlational methods was brought out. We now link the two approaches and, in the
process, show that analysis of variancecan be done using multiple regression. In addition,
the linking of the two approaches will happily yield unexpected bonuses. We will see, for
example, that certain analytic problems that are intractable with analysis of variance — or
at least difficult and certainly inappropriate— are quite easily conceptualized and accom-
plished by the judicious and flexible use of multiple regression. Because of space limita-
tion and because the book's purpose is not to teach the mechanics of statistical methods
and approaches, the discussion will be quite limited: some of what is said must be taken
on faith. Nevertheless, even at a somewhat limited level of discourse we will find that
certain difficult problems associated with analysis of variance — analysis of covariance,
pretest and posttest data, unequal numbers of cases in cells (of factorial designs), and the
handling of both experimental and nonexperimental data — are naturally and easily han-
dled with multiple regression analysis.
552 • Multivariate Approaches and Analysis
significant at the .01 level. The effect of the experimental treatment is clearly significant.
Tj- = sst,/ss, = 90/120 = .75.The relation between the experimental treatment and com-
prehension is strong.
Now, from an analysis of variance framework to a multiple
transfer our thinking
regression framwork. tj- = .75 "directly"? The independent variable,
Can we obtain
methods, can be conceived as membership in the three experimental groups, Ai, Ai. and
A3. This membership can be expressed by I's and O's: if a subject is a member of A^,
Table 34.1
Multiple Regression, Analysis of Variance, and Other Multivariate Methods -553
X,
1.6667
Multiple Regression, Analysis of Variance, and Other Multivariate Methods • 555
.75/2 .375000
F = = = 18.
(1 - .75)/(15-2- 1) .020833
_ SSregfdfi _ SSreg/k
ssrjdfi ss,J{N - k - \)
90/2 45.00
= 18.
30/(15 -2-1) 2.50
.
This F ratio is then checked in an F table (see Kerlinger and Pedhazur, App. D), at df = 2,
12. The entry al p = .05 is 3.88 and at p = .01 it is 6.93. Since F of 18 calculated above
is greater than 6.93. the regression is statistically significant, and R' is statistically signifi-
cant.
Check the multiple regression values calculated with those obtained earlier from the
analysis of variance. The values of the sums of squares, the mean squares, and F are the
same. R- also equals 17-. In addition, the values of a and the b's tell us something about
the data, a = 3 is the mean of the group-assigned zeroes in both coded vectors. i>i = 3 is
the difference between the means of Ai and A3, the group-assigned zeroes in both vectors:
6 - 3 = 3. ^2 is the difference between the means of y42 and A3: = 6. The means of 9-3
the three groups are easily found by using the regression equation:
r = a + biXi + boX2
y^, = 3 + (3)(l) + (6)(0) = 6
r^, = 3 + (3)(0) + (6)(l) = 9
y^; = 3 + (3)(0) + (6)(0) = 3
sshldft
F = F=
' TT
SSrJdf2 SSjdfi
While it has been shown that multiple regression analysis accomplishes what one-way
analysis of variance does, can it be said that there is any real advantage to using the
regression method? Actually the calculations are more involved. Why do it, then? The
answer is that with the kinds of data of the example above there is no practical advantage
beyond aesthetic nicety and conceptual clarification. But when research problems are
—
more complex when, for example, interactions, covariates (e.g., intelligence test
scores), nominal variables (sex, social class), and nonlinear components (X-, X^) are
involved —the regression procedure has decided advantages. Indeed, many research ana-
lyticproblems that analysis of variance cannot handle readily or at all can be fairly readily
accomplished with multiple regression analysis. Factorial analysis of variance, analysis of
covariance, and, indeed, all forms of analysis of variance can also be done with regression
analysis. Since it is not our purpose to teach statistics and the mechanics of analysis, we
refer the reader to appropriate discussions like those cited in footnote 3. We will explain in
the next section, however, the nature of highly important methods of coding variables and
their use in analysis.
Table 34.5
Multiple Regression, Analysis of Variance, and Other Mul ivariate Methods • 557
dummy coding of Table 34.5, using only two subjects per experimental group. Since there
are two degrees of freedom, or - = 3 - = 2, there are two column vectors labeled
/t 1 1
Xi and A':. The dummy coding assignment has already been explained: a indicates that a 1
subject is a member of the experimental group against which the is placed, and a that 1
of effects coding, on the other hand, is that the intercept constant, a, yielded by the
multiple regression analysis will equal the grand mean, or A/„ of Y. For the data of Table
34.2, the intercept constant is 6.00, which is the mean of all the Y scores.
The third form of coding is orthogonal coding. (It is also called "contrasts" coding,
but some contrasts coding can be nonorthogonal.) As its name indicates, the coded vectors
are orthogonal or uncorrelated. If an investigator's main interest is in specific contrasts
between means rather than the overall F test, orthogonal coding can provide the needed
contrasts. In any set of data, a number of contrasts can be made. This is, of course,
particularly useful in analysis of variance. The rule is that only contrasts that are orthogo-
or M^, -(Ma, + M^^yi. X, is then coded (0,-1, 1) and X2 is coded (2, -1, -1), as
shown by the orthogonal coding of Table 34.5. The interested reader grounded in analysis
of variance can follow up such possibilities.^
No matter what kind of coding is used, R-. F, the sums of squares, the standard errors
of estimate, and the predicted K's will be the same (the means of the experimental
groups). The intercept constant, the regression weights, and the t tests of Z? weights will be
different. Strictly speaking, it is not possible to recommend one method over another;
each has its purposes. At first, it is probably wise for the student to use the simplest
method, dummy coding, or I's and O's. He should fairly soon use effects coding, how-
ever. Finally, orthogonal coding can be tried and mastered.^
The simplest use of coding is to indicate nominal variables, particularly dichotomies.
Some variables are "natural" dichotomies: sex, public school-parochial school, convic-
tion-no conviction, vote for- vote against. All these can be scored (1,0) and the resulting
vectors analyzed as though they were continuous score vectors. Most variables are contin-
uous, or potentially so, however, even though they can always be treated as dichotomous.
In any case, the use of (1, 0) vectors for dichotomous variables in multiple regression is
highly useful.
''Colien. op. cii.. pp. 428-434, especially pp. 432-434. For a detailed discussion, see Kerlinger and
Pedhazur, op. cii., chap. 7.
'Before using orthogonal coding to any extent, the student should study the topic of comparisons of means.
See W. Hays, Slatislics. 3d ed. New York: HoU, Rinehart and Winston, 1981, chap. 12.
558 • Multivariate Approaches and Analysis
With nominal variables that are not dichotomies one can still use (1.0) vectors. One
simply creates a ( 0) vector for each subset but one of a category or partition. Suppose
1 .
the category A is partitioned into A,. Ai. A3, say Protestant. Catholic. Jew. Then a vector
is created for Protestants, each of which is assigned a 1; the Catholics and Jews are
assigned 0. Another vector is created for Catholics: each Catholic is assigned 1 ; Protes-
tants and Jews are assigned 0. Itwould, of course, be redundant to create a third vector for
Jews. The number of vectors is k — 1 where k = the number of subsets of the partition or
,
category.
While sometimes convenient or necessary, partitioning a continuous variable into a
dichotomy or trichotomy throws information away. If, for example, an investigator
dichotomizes intelligence, ethnocentrism, cohesiveness of groups, or any other variable
that can be measured with a scale that even approximates equality of interval, he is
discarding information. To reduce a set of values with a relatively wide range to a dichot-
omy is to reduce its variance and thus its possible correlation with other variables. A good
rule of research data analysis, therefore, is: Do not reduce continuous variables to parti-
tioned variables (dichotomies, trichotomies, etc.) unless compelled to do so by circum-
stances or the nature of the data (seriously skewed, bimodal. etc.).
It is with factorial analysis of variance, analysis of covariance. and nominal variables that
we begin to appreciate the We do little more
advantages of multiple regression analysis.
here than comment on coded vectors in factorial analysis of variance. Excep-
the use of
tionally full discussions can found in Pedhazur's exhaustive work.** We will, however,
explain the basic reason why multiple regression analysis is often better than factorial
analysis of variance.
The underlying difficulty in research and analysis is that the independent variables in
which we are interested are correlated. Analysis of variance, however, assumes that they
are uncorrected. If we have, say, two experimental independent variables and subjects
are assigned at random to the cells of a factorial design, we can assume that the two
independent variables are not correlated —
by definition. And factorial analysis of vari-
ance is appropriate. But if we have two nonexperimental variables and the two experimen-
tal variables, we cannot assume that all four independent variables are uncorrelated.
Although there are ways to analyze such data with analysis of variance, they are cumber-
some and "unnatural." Moreover, if there are unequal /;"s in the groups, analysis of
variance becomes still more inappropriate because unequal /fs also introduce correlations
between independent variables. The analytic procedure of multiple regression, on the
other hand, takes cognizance, so to speak, of the correlations among the independent
variables as well as between the independent variables and the dependent variable. This
means that multiple regression can effectively analyze both experimental and nonexperi-
mental data, separately or together. Moreover, continuous and categorical variables can
be used together.
When subjects have been assigned at random to the cells of a factorial design and other
things are equal, there isn't much benefit from using multiple regression. But when the n's
of the cells are unequal, and one wants to include one, two, or mure control variables
like intelligence, sex, and social class —
then multiple regression should be used. This
point most important. In analysis of variance, the addition of control variables is
is
difficult and clumsy. With multiple regression, however, the inclusion of such variables is
easy and natural: each of thein is merely another vector of scores, another Xy!
Analysis of Covariance
It has been found in large-scale studies by Prothro and Grigg and by McClosky that
people's agreement with social issues is more abstract the issue.'" Suppose a
greater the
political scientist believes that authoritarianism has a good deal to do with this relation,
that the more authoritarian the person, the more he agrees with abstract social assertions.
In order to study the relation between abstractness and agreement, he will have to control
authoritarianism. In other words, the political scientist is interested in studying the rela-
tion between abstractness of issues and statements, on the one hand, and agreement with
such issues and statements, on the other hand. He is not at this point interested in authori-
tarianism and agreement; he needs, rather, to control the influence of authoritarianism on
agreement. Authoritarianism is the covariate.
The political scientist devises three experimental treatments, A^. At, and A3, different
levels of abstractness of materials. He obtains responses from 15 subjects who have been
assigned randomly to the three experimental groups, five in each group. Before the exper-
iment begins the investigator administers the F (authoritarianism) scale to the 15 subjects
and uses these measures as a covariate. He wishes to control the possible influence of
authoritarianism on agreement. This is a fairly straightforward analysis of covariance
problem in which we test the significance of the differences among the three agreement
means after correcting the means for the influence of authoritarianism and taking into
account the correlation between authoritarianism and agreement. We now do the analysis
of covariance using multiple regression analysis.
First, the data are presented in the usual analysis of covariance way in Table 34.6. In
analysis of covariance one does separate analyses of variance on the X scores, the Y
scores, and the cross products of the X
and Y scores, XY. Then, using regression analysis,
one calculates sums of squares and mean squares of the errors of estimate of the total and
the within groups and, finally, the adjusted between groups. Since the concern here is not
with the usual analysis of covariance procedure, we do not do these calculations. Instead,
we proceed immediately to a multiple regression approach 'o the analysis.
Treatments
A, A.
12
Multiple Regression, Analysis of Variance, and Other Multivariate Methods • 561
which, at 2 and 11 degrees of freedom, is significant at the .05 level. (Note that an
ordinary one-way analysis of variance of the three groups, without taking the covariate
into account, yields a nonsignificantF ratio.) Rl 2^• or the variance of Y accounted for by
the regression on variables 2 and 3 (the experimental treatments), after allowing for the
correlation of variable and Y, is .1110. While this is not a strong relation, especially
1
compared with the massive correlation between the covariate, authoritarianism, and Y
(rf,, = .75), it is not inconsequential. Evidently abstractness of issues influences agree-
ment responses: the more abstract the issues, the greater the agreement."
The analysis of covariance, then, is seen to be simply a variation on the theme of
multiple regression analysis. And in this case it happens to be easier to conceptualize than
the rather elaborate analysis of covariance procedure — especially if there is more than one
in another study. Cohen neatly says, "one man's main effect is another man's covari-
ate. '-
The beauty, power, and general applicability of multiple regression emerge rather
clearly in this example. And it should be borne in mind that two, even three, covariates
can be easily handled with multiple regression. With two covariates and two other inde-
pendent variables, for instance, we simply write the F ratio:
Carry the reasoning a step further. The use of analysis of covariance with factorial
designs is complex. It is simpler with multiple regression analysis. Take a 2 x 2 factorial
Y
562 • Multivariate Approaches and Analysis
would expect from name, multivariate analysis of variance is the multivariate counter-
the
part of analysis of variance; the influence of k independent experimental variables on m
dependent variables is assessed. Path analysis is more a graphic and heuristic aid than a
multivariate method. As such, it has great usefulness, especially for helping to clarify and
conceptualize multivariate problems.
Discriminant Analysis
group membership. The function maximally discriminates the members of the group; it
tells us to which group each member probably belongs. In short, if we have two or more
independent variables and the members of, say, two groups, the discriminant function
gives the "best" prediction, in the least-squares sense, of the "correct" group member-
ship of each member of the sample. The discriminant function, then, can be used to assign
individuals to groups on the basis of their scores on two or more measures. From the
scores on the two or more measures, the least-squares "best" composite score is calcu-
lated. If this is so, then, the higher the R' the better the prediction of group membership.
In other words, when dealing with ptvo groups, the discriminant function is nothing more
than a multiple regression equation with the dependent variable a nominal variable (coded
0, 1) representing group membership. (With three or more groups, however, discriminant
analysis goes beyond multiple regression methods.")
Discriminant analysis can be used to study the relations among variables in different
populations or samples. Suppose we have ratings of administrators on administrative
performance and we also have found, through the In-Basket Test, that three kinds of
performance are important. '"* We wish to know how successful and unsuccessful adminis-
trators, as judged by an independent criterion, perform on the three tests, which are
Ability to Work With Others (X\ ), Motivation for Administrative Work (X^), and General
Professional Skill (X3). Suppose the discriminant regression equation were; Y' =
.O6X1 -I- .45X2 + .30X3. From this equation, we can form the tentative conclusion that
Ability to Work With Others seems unimportant compared to Motivation for Administra-
tive Work and General Professional Skill. In other words, the discriminant equation gives
us a profile picture of the difference between successful and unsuccessful administrators
as measured by the In-Basket Test.
Canonical Correlation^^
It is not too large a conceptual step from multiple regression analysis with one dependent
variable to multiple regression analysis with more than one dependent variable. Computa-
tionally, however, it is a considerable step. We will not, therefore, supply the actual
calculations. The regression analysis of data with k independent variables and m depend-
ent variables is called canonical correlation analysis. The basic idea is that, through
"The reader will find excellent guidance in: M. Tatsuoka, Discriminanl Analysis: The Study of Group
Differences. Champaign, III.: Institute for Personality and Ability Testing. 1970.
Hemphill, D. Griffiths, and N. Frederiksen, Adminisiraiive Performance and Personality.
'"J. New York:
Teachers College Press. 1962.
reasons that we will not now discuss, canonical correlation analysis can be considered obsolescent. We
"For
will,however, examine it briefly because one or two of its aspects need to be understood, the most important of
which is that a traditional multivariate view terms it a generalization of multiple regression: the relations among
k .< variables, on the one hand, and m y variables, on the other hand.
Multiple Regression, Analysis of Variance, and Other Multivariate Methods • 563
least-squares analysis, two linear composites are formed, one for the independent varia-
bles, Xj, and one for the dependent variables, Yj. The correlation between these two
composites is the canonical correlation. And. like R, it will be the maximum correlation
possible given the particular sets oi data. It should be clear that what has been called until
now multiple regression analysis is a special case of canonical analysis. In view of practi-
cal limitations on canonical analysis, it might be better to say that canonical analysis is a
generalization of multiple regression analysis.
Tetenbaum studied the relations between, on the one hand, personality measures of the
needs for order, achievement, and affiliation and. on the other hand, control, intellec-
tuality, dependency, and ascendancy as shown in ratings of teachers by graduate stu-
dents."' The general hypothesis tested was that the personality needs would be related to
the teacher ratings. They were. The three needs were related to three sources of covaria-
tion in the teacher ratings. As predicted, students' rating of teachers were influenced by
their needs. (See below for comments on multiple sources of covariation.)
The second use of canonical analysis comes from a study by Walberg of the relations
between five sets of independent variables consisting of measures of the social environ-
ment of learning, student biographical items, and miscellaneous variables (dogmatism,
authoritarianism, intelligence, and so on), on the one hand, and a set of dependent varia-
bles consisting of cognitive and noncognitive measures of learning.'^ Separate analyses
were run between each set of independent variables and the set of dependent variables. Of
the five sets of independent variables, three predicted significantly to the learning criteria.
One interesting result was between the learning environment
the canonical correlation
variables — Intimacy, and so on
Friction, Formality, —
and the dependent learning varia-
bles — Science Understanding, Science Interest, and so on. The canonical R was .61,
indicating a fairly substantial relation between the linear composites of the two sets of
variables.
In order to understand the significance of what is theoretically perhaps Walberg's most
important finding, the reader should know that, as we found out in earlier discussions,
there can be mOre than one source of variation in a set of data. Similarly, there can be
more than one source of covariation in the two sets of variables being analyzed by canoni-
cal correlation. If there is more than one source, then more than one canonical correlation
can be found.
Walberg found that 15 of the independent variables each correlated significantly with
the set of dependent variables collectively. In a separate canonical analysis of these two
sets of variables, two statistically significant canonical correlations were found: .64 and
.60.'** The first canonical variate or component was produced by the independent varia-
bles positively correlated with the cognitive learning gains of Physics Achievement, Sci-
ence Understanding, and Science Processes. The second variate was produced by the
gains on noncognitive dependent variables: Science Interest, Physics Interest, and Physics
Activities. In short, Walberg was able, through canonical analysis, to present a highly
condensed generalization, as he calls it, about the relations between cognitive learning,
noncognitive learning, and a variety of environmental and other variables related to
learning.
'^T. Tetenbaum, "The Role of Student Needs and Teacher Oiientations in Student Ratings of Teachers,"
American Educational Research Journal, 12 (1975). 417-433.
'^H. Walberg. "Predicting Class Learning: An Approach to the Class as a Social System." American
Educational Research Journal. 6 (1969). 529-542.
'*The nature of canonical analysis is such that when the second linear component is calculated, it is orthogo-
nal to the first component. Thus the above canonical correlations reflect two independent sources of variance in
the data.
564 • Multivariate Approaches and Analysis
As one might suspect, analysis of variance has its multivariate counterpart, multivariate
analysis of variance, which enables researchers to assess the effects of k independent
variables on m dependent variables. Like its univariate companion, which we examined in
some detail earlier, it is or should be used for experimental data. We forego further
discussion here except to say that, as in all or most multivariate analysis, the results of
multivariate analysis of variance are sometimes difficult to interpret because the difficul-
ties mentioned earlier of assessing the relative importance of variables in this influence on
one dependent variable, as in multiple regressive analysis, are often compounded in
'^
multivariate analysis of variance, canonical correlation, and discriminant analysis.
Path Analysis
Path analysis is a form of applied multiple regression analysis that uses path diagrams to
guide problem conceptualization or to test complex hypotheses. Through its use one can
calculate the direct and indirect influences of independent variables on a dependent varia-
ble. These influences are reflected in so-called path coefficients, which are actually stand-
ardized regression coefficients (beta, B). Moreover, one can test different path models for
their congruence with observed data.-° While path analysis has been and is an important
analytic and heuristic method, it is doubtful that it will continue to be used to help test
models for their congruence with obtained data. Rather, its value will be as a heuristic
method to aid conceptualization and the formation of complex hypotheses. The testing of
such hypotheses, however, will probably be done with analytic tools more powerful and
more appropriate for such testing. Let us look at an example to give a general idea of the
approach.
Consider the two models, a and of Figure 34. 1-' Suppose we are trying
b, to "ex-
plain" achievement, .X4. in the figure, orGPA, grade-point average. I believe that model a
is "correct"; you believe, however, that model b is "correct." Model a says, in effect,
that SES and intelligence both influence x^. n achievement, or need for achievement, and
that .1C3 influences .V4, GPA or achievement. Well and good! I believe, in other words, that
model a best expresses the relations among the four variables. On the other hand, you
believe that model b is a better representation. It adds a direct influence of .V2, intelligence,
on .X4, achievement (note the paths from X2 to X4 and from .v^ to x^ to X4). Which model is
"correct"? It is possible in path analysis to test the two models. From the calculated path
coefficients one produces two correlation matrices, R^ and Rt,- Each of these is compared
to the original four-by-four matrix, R, and the one whose correlations are closer to those
of /? is a "better" model. (It is possible, of course, that both models produce much the
same results or that neither does.)
The calculations of the path coefficients are easy. Take model a. We calculate, first,
the regression of .V3 on .v, and xi- This will yield the two path coefficients pjj and p7,2,
which are the same as the beta weights, /3i and /S:. Second, calculate the regression of .V4
on .V3, which yields p4T,, or )343. (In addition, error terms are also calculated. We ignore
them here.) From these path coefficients one calculates the R matrix they imply and
"See Pedhazur, op. cil. chaps. 17-18. For another very good discussion of multivariate analysis of vari-
ance, see Bray and S. Maxwell. "Analyzing and Interpreting Significant MANOVAs," Review of Educa-
J.
SES
Figure 34.1 .V|: SES (Socioeconomic Status); .vi: Intelligence; xy. n-Ach, or Need for Achieve-
ment; .X4: GPA, or Grade-Point Average (Achievement)
related statistics. ~~ These are the ideas behind path analysis. We will return to such ideas
when we study analysis of covariance structures in Chapter 36, a much more satisfying
and scientifically rigorous approach to testing alternative models.
^^Pedhazur gives all details of the calculations. Students are urged to study his discussion and examples.
566 • Multivariate Approaches and Analysis
effective in teaching and learning research as drawing paradigms of the designs using
analysis of variance analytic partitioning.
The answer is that both methods should be taught and learned. The additional de-
mands on both teacher and student are inevitable, just as the development, growth, and
use of inferential statistics earlier in the century made their teaching and learning inevita-
ble. Multiple regression and other multivariate methods, however, will no doubt suffer
some of the lack of understanding, even opposition, that inferential statistics has suffered.
Even today there are psychologists, sociologists, and educators little about who know
inferential statistics or modem analysis, and who even oppose
and use. This
their learning
is part of the social psychology and pathology of the subject, however. While there will no
doubt be cultural lag, the ultimate acceptance of these powerful tools of analysis is proba-
bly assured.
Multivariate methods, as we have
seen, are not easy to use and to interpret. This is due
not only to their complexity; due even more to the complexity of the phenomena that
it is
behavioral scientists work with. One of the drawbacks of educational research, for in-
stance, has been that the enormous complexity of a school or a classroom could not
adequately be handled by the too-simple methods used. Scientists, naturally, can never
mirror the "real" world with their methods of observation and analysis. They are forever
bound to simplifications of the situatiops and problems they study. They can never "see
things whole," just as no human being can see and understand the whole of anything. But
multivariate methods mirror psychological, sociological, and educational reality better
than simpler methods, and they enable researchers to handle larger portions of their
research problems. In educational research, the days of the simple methods experiment
with an experimental group and a control group are almost over. In sociological research,
the reduction of much valuable data to frequency and percentage crossbreaks will decrease
relative to the whole body of sociological research.
Most important of all, the healthy future of behavioral research depends on the healthy
development of psychological, sociological, and other theories to help explain the rela-
tions among behavioral phenomena. By definition, theories are interrelated sets of con-
structs or variables. Obviously, multivariate methods are well adapted to testing fairly
complex theoretical formulations, since their very nature is the analysis of several varia-
bles at once. Indeed, the development of behavioral theory must go hand-in-hand, even
depend upon, the assimilation, mastery, and intelligent use of multivariate methods.
Study Suggestions
Kerlinger, F.. and Pedhazur, E. Multiple Regression in Behavioral Research. New York:
Holt, Rinehart and Winston, 1973. A text that attempts to enhance understanding of multi-
ple regression and its research uses by providing as simple an exposition as possible and
many examples with simple numbers. Also has a complete multiple regression computer
program (Appendix D).
CooLEY.W., and Lohnes, P. Multivariate Data Analysis. New York: Wiley, 1971. Although
more difficult than its predecessor, its computer routines (in Fortran) can be adapted to
different installations. It is also an important textbook.
Lewis-Beck. M. Applied Regression: An Introduction. Beverly Hills: Sage Publications. 1980.
One of the Sage manuals of quantitative applications. A good workable treatment that
covers most essential points.
Multiple Regression, Analysis of Variance, and Other Multivariate Methods • 567
Snedecor, G., and Cochran. G. Stuiisiicul Methods. 6th ed. Ames, Iowa: Iowa State Univer-
sity Press, 1967. Chaps. 6 and 13 are pertinent —
and very good, indeed.
TaTSUOKA, M. Midlivariate Analysis: Techniques for Educational and Psychological Research.
New York: Wiley, 1971, This clearly written middle-level book has little discussion of
multiple regression, but competently attacks many important multivariate problems.
Tatsuoka, M. Discriminant Analysis: The Study of Group Differences. Champaign, 111.: Insti-
tute for Personality and Ability Testing, 1970. An excellent manual. Highly recommended.
After the student and researcher have mastered the elements of multiple regression analysis and
have had some experience with actual problems, the following references provide sophisticated
guidance in the use of multiple regression analysis and, more important, the interpretation of data.
70 (1968), 426-433. Successfully shows the relation between multiple regression and
analysis of variance and also suggests general research uses of multiple regression.
Darlington, R. "Multiple Regression in Psychological Research and Practice." Psychologi-
cal Bulletin. 69 (1968), 161-182. Excellent, highly sophisticated, and sobering discussion
of multiple regression.
RuLON, and Brooks, W. "On Statistical Tests of Group Differences." In D. Whitla,
P.,
ed.. Handbook of Measurement and Assessment in Behavioral Sciences. Reading,
Mass.: Addison- Wesley, 1968, chap. 2. Lean exposition of the relations among a wide
range of tests of statistical significance. Highly recommended (for the advanced stu-
dent).
The following books are fundamental: they emphasize the theoretical and mathematical bases of
multivariate methods.
Green, Mathematical Tools for Applied Multivariate Analysis New York; Academic Press,
P. .
.00. How much dependence can be put on these weights? What would happen if we reversed the
order of entry of the independent variables?
4. Here are three sets of simple fictitious data, laid out for an analysis of variance. Lay out the
data for multiple regression analysis, and calculate as much of the regression analysis as possible.
Use dummy coding (1, 0), as in Table 34.2. The b coefficients are: b] = 3; ^t = 6.
-4, A2 A,
7 12 5
6 9 2
Imagine
5
9
8 114
10
8
6
3
and At, are three methods of changing racial attitudes and that the dependent
that A], A2.
variable measure of change with higher scores indicating more change. Interpret the results.
is a
[Answers: a = 4; R- = .75; F = 18, with df = 2. 12; M„g = 90; ss, = 120. Note that these ficti-
tious data are really the scores of Table 34.2 with 1 added to each score. Compare the various re-
gression and analysis of variance statistics, above, with those calculated with the data of Table 34.2]
5. Using the data of Table 33.2 in Chapter 33. calculate the sums of each X^ and X2 pair.
Correlate these sums with the Y scores. Compare the square of this correlation with R; 12 - -51
(r" = .70" = .49). Since the two values are quite close, why shouldn't we simply use the averages
of the independent variables and not bother with the complexity of multiple regression analysis?
6. Here are several interesting studies that have effectively used multiple regression, path
analysis, and discriminant analysis. Read one or two of them carefully. Those marked with an
asterisk are perhaps easier than the others.
Bachman. J., and O'Malley. P. "Self-Esteem in Young Men: A Longitudinal Analysis of the
Factor Analysis
Because of its power, elegance, and closeness to the core of scientific purpose, factor
analysis can be called the queen of analytic methods. In this chapter we explore what
factor analysis and why and how it is done. In the exploration we will also examine old
is
and new researches in which factor analysis has been a central methodology.
Factor analysis serves the cause of scientific parsimony. It reduces the multiplicity of
tests and measures to greater simplicity. It tells us, in effect, what tests or measures
belong together —
which ones virtually measure the same thing, in other words, and how
much they do so. It thus reduces the number of variables with which the scientist must
cope. It also helps the scientist locate and identify unities or fundamental properties
underlying tests and measures.
A factor is a construct, a hypothetical entity, a latent variable that is assumed to
underlie tests, scales, items, and, indeed, measures of almost any kind. A number of
factors have been found to underlie intelligence, for example: verbal ability, numerical
ability, abstract reasoning, spatial reasoning, memory, and others. Similarly, aptitude,
attitude, and personality factors have been isolated and identified. Even nations and peo-
ple have been factored!
FOUNDATIONS
A Hypothetical Example
AS
Cluster I
Factor Analysis • 571
V. N. AS. and AT all involve numerical or arithmetic operations. Suppose we named this
factor Arithmetic. A friend points out to us that test A' does not really involve arithmetic
operations, since it consists mostly of manipulating numbers nonarithmetically. We over-
looked our eagerness to name the underlying unity. Anyway, we now name the
this in
factor Numerical, or Number, or A'. There is no inconsistency; all three tests involve
numbers and numerical manipulation and operation.
Both questions have been answered; there are two factors, and they are named Verbal,
v. and Numerical. N. It must be hastily and urgently pointed out, however, that neither
question is ever finally answered in actual factor analytic research. This is especially true
in early investigations of a field. The number of factors can change in subsequent investi-
gations using the same tests. One of the V tests may also have some variance in common
with another factor, say K. If a test measuring K is added to the matrix, a third factor may
emerge. Perhaps more important, the name of a factor may be incorrect. Subsequent
investigation using these V tests and other tests may show that V is not now common to
all the tests. The investigator must then find another construct, another source of common
factor variance. In short, factor names are tentative; they are hypotheses to be tested in
further factor analytic and other kinds of research.^
If a test measures one factor only, it is said to bt factorially "pure." To the extent that a
test measures a factor, it is said to be loaded on the factor, or saturated with the factor.
Factor analysis is not really complete unless we know whether a test is factorially "pure"
and how saturated it is with a factor. If a measure is not factorially pure, we usually want
to know what other factors pervade it. Some measures are so complex that it is difficult to
tell just what they measure. A good example is teacher grades, or grade-point averages. If
-Nunnally's excellent chapter on factor analysis is well worth study: J. Nunnally, Psychometric Theory. 2d
ed. New York: McGraw-Hill. 1978, chap. 10,
'Actually, factor analytic methods do not yield final solutions like that in Table 35.2. They yield solutions
that require what is called "rotation of a,xes." Rotation will be discussed later.
'Some factor analysts label final solution factors I. II or I', 11', . . . . In this chapter we label
line, .70 is the factor loading of test A' on factor 6. Test AS has the following loadings: . 10
on factor A and .79 on factor B.
Factor loadings are not hard to interpret. They range from - 1 .00 through to + 1 .00,
like correlation coefficients. They are interpreted similarly. In short, they express the
correlations between the tests and the factors. For example, test V has the following
correlations with factors A and B, respectively: .83 and .01. Evidently test V is highly
loaded on A, but not at all on B."^ Tests V. R. and S are loaded on A but not on B. Tests A^,
AS. and AT are loaded on B but not on A. All the tests are "pure."
The entries in the last column are called commimalities, or /?-'s. They are the sums of
squares of the factor loadings of a test or variable. For example, the communality of test R
is (.79)' + (.10)"^ = .63. The communality of a test or variable is its common factor
matrices rarely present such a clear-cut picture. Indeed, the factor matrix of Table 35.2
was "known." The author first wrote the matrix given in Table 35.3. If this matrix is
multiplied by itself, the R matrix of Table 35. 1 (with diagonal values) will be obtained. In
this case, all that is necessary to obtain R is to multiply each row by every other row. For
example, muhiply row V by row R. (.90)(.80) + (.00)(.10) = .72; row V by row S:
(.90)(.70) + (.OOK.IO) = .63; row S by row AS: (.70)(.10) + (.10)(.80) = .15; and so
on. The resulting R matrix was then factor-analyzed.^
It is instructive to compare Tables 35.2 and 35.3. Note the discrepancies. They are
small. That is, method cannot perfectly reproduce the "true"
the fallible factor analytic
factor matrix. It estimates it. In this case the fit is close because of the deliberate simplic-
ity of the problem. Real data are not so obliging. Moreover, we never know the "true"
factor matrix. If we did, there would be no need for factor analysis. We always estimate
Tests
Factor Analysis • 573
Tests
574 • Multivariate Approaches and Analysis
Figure 35.
theory.^ h' is the proportion of the total variance that is common factor variance. r„ is the
proportion of the total variance that is reliable variance. VJV, is the proportion of the total
variance that is error variance. In Chapter 27 an equation like this enabled us to tie
reliability and validity together. Now, it shows us the relation between factor theory and
measurement theory. We see, in brief, that the main problem of factor analysis is to
determine the variance components of the total common factor variance.
Take test V in Table 35.2. A glance at Equation 35.6 shows us, among other things,
that the reliability of a measure is always greater than, or equal to, its communality. Test
V's reliability, then, is at least .70. Suppose r„ = .80. Since V,/V, = 1.00, we can fill in
all the terms:
h- = .69 V,p V,
/„ = .80
Test V, then, has a high proportion of common factor variance and a low proportion of
specific variance.
The proportions can be seen clearly in a circle diagram. Let the area of the circle equal
the total variance, or 1.00 (100 percent of the area), in Figure 35.1. The three variances
have been indicated by blocking out areas of the circle. Vfo, or h~, for example, is 69
percent, V^^ is 11 percent, and V^ is 20 percent of the total variance.
A factor analytic investigation including test V would tell us mainly about V,,„, the
common factor variance. It would tell us the proportion of the test's total variance that is
common and would give us clues to its nature by telling us which other
factor variance
tests share the same common factor variance and which do not.^
The student of factor analysis must learn to think spatially and geometrically if he is to
grasp the essential nature of the factor approach. There are two or three good ways to do
'See J. Guilford, Psychometric Methods, 2d ed. New York: McGraw-Hill, 1954. pp. 354-357, and
Thurstone, op. cit., chap. II.
''See Figure 27.1 in Chapter 27 for a diagrammatic two-test illustration of these notions.
Factor Analysis • 575
but not on the other. They are all relatively "pure" measures of their respective factors. A
seventh point has been indicated in Figure 35.2 by a circled cross in order to illustrate a
presumed test that measures both factors. Its coordinates are (.60, .50). This means that
the test is loaded on both factors, .60 on A and .50 on B. It is not "pure." Factor
structures of this simplicity and clarity, where the factors are orthogonal (the axes at right
angles to each other), the test loadings substantial and "pure," almost no tests loaded on
two or more factors, and only two factors, are not common.
Most published factor analytic studies report more than two factors. Four, five, even
nine, ten, and more factors have been reported. Graphical representation of such factor
structures in one graph is, of course, not possible. Factor analysts customarily plot factors
two at a time, though it is possible to plot three at a time. It must be admitted, however,
that it is difficult to visualize or keep in mind complex n-dimensional structures. One
therefore visualizes two-dimensional structures and generalizes to n dimensions algebrai-
cally.''
I
' I I I
I I I I -1-4- I
I
I
-.40
Figure 35.2
"•One of the fortunate aspects of computer factor analysis programs is tliat such factor plotting is easily
possible. In the widely available BMDP and SAS factor analysis programs, for example, one can instruct the
computer to print the plots one wishes to see. W. Dixon, ed., BMDP Statistical Software 1981. Berkeley:
University of California Press, 198 Program P4M, Factor Analysis, p. 497; SAS User's Guide 1979 Edition.
1 ,
Gary, N.C.: SAS Institute, 1979, Program Factor, p. 204. It is highly likely, too, that the expanded memories of
microcomputers will make factor plotting possible in the near future.
576 • Multivariate Approaches and Analysis
most R matrices, however, the clusters cannot be so easily identified. More objective and
precise methods are needed.
extracts a maximum amount of variance as each factor is calculated. In other words, the
first factor extracts the most variance, the second the next most variance, and so on.
To show the logic of the principal factors method without considerable mathematics is
difficult. One can achieve a certain intuitive understanding of the method, however, by
approaching it geometrically. Conceive tests or variables as points in m-dimensional
space. Variables that are highly and positively correlated should be near each other and
away from variables with which they do not correlate. If this reasoning is correct, there
should be swarms of points in space. Each of these points can be located in the space if
suitable axes are inserted into the space, one axis for each dimension of the m dimensions.
Then any point's location is its multiple identification obtained by reading its coordinates
on the m axes. The factor problem is to project axes through neighboring swarms of points
and to so locate these axes that they "account for" as much of the variances of the
variables as possible.
Imagine the room you are sitting in to have swarms of points in various parts of its
three-dimensional space. Imagine that some of the points cluster together in the upper
right center of the room (from your vantage point). Now imagine another cluster of points
at another point in the room, say in the lower right center. Part of the problem is to locate
axes — three axes in this case, since the room is three-dimensional — so as to identify and
appropriately label the swarms and the points in the swarms.
We can demonstrate these ideas with a simple two-dimensional example. Suppose we
have five tests. These tests, let us say, are situated in two-dimensional space as indicated
in Figure 35.3. The closer two points are, the more they are related. The problem is to
determine: ( ) how many factors there are; (2) what tests are loaded on what factors; and
1
"See ibid., chap. 8. This is a thorough exposition, with computational and analytic details.
Factor Analysis • 577
578 • Multivariate Approaches and Analysis
points 4 and 5 are high on factor A, and point 3 has low loadings on both factors. The three
questions originally asked have been answered.
This procedure is analogous to psychological factor problems. Tests are conceived as
points in factor m-dimensional space. The
factor loadings are the coordinates. The prob-
lem is frames or axes and then to "read off" the factor
to introduce appropriate reference
loadings. Unfortunately, in actual problems we do not know the number of factors (the
dimensionality of the factor space and thus the number of axes) or the location of the
points in space. These must be determined from data.
The above description is figurative. One does not "read off" factor loadings from
reference axes; one calculates them using rather complex methods. The principal factors
method actually involves the solution of simultaneous linear equations. The roots obtained
from the solution are called eigenvalues. Eigenvectors are also obtained; after suitable
transformation, they become the factor loadings. The fictitious R matrix of Table 35. 1 was
solved in this manner, yielding the factor matrix to be given later in Table 35.5. Most
computer analysis programs use principal factors solutions. And the programs given in
published computer program texts use it. The student who expects to use factor analysis to
Factor Analysis • 579
Table 35.4
580 • Multivariate Approaches and Analysis
This would amount to saying that all the tests measure the same thing (factor I), but that
the first three measure the negative aspect of whatever the second three measure (factor
II). But aside from the ambiguous nature of such an interpretation, we know that the
reference axes, I and II, and consequently the factor loadings, are arbitrary. Look at the
factor plot of Figure 35.2. There are two clearly defined clusters of tests clinging closely
to the axes A and B. There is no general factor here, nor is there a bipolar factor. The
second major problem of factor analysis, therefore, is to discover a unique and compelling
solution or position of the reference axes.
Plot the loadings of I and II, and we "see" the original unrotated structure. This has
been done in Figure 35.5. Now swing the axes so that I goes as near as possible to the V,
R. and S points and, at the same time. II goes as near as possible to the N. AS. and AT
points. A rotation of 45 degrees will do nicely. We then obtain essentially the structure of
Figure 35.2. That is, the new rotated positions of the axes and the positions of the six tests
are the same as the positions of the axes and tests of Figure 35.2. The structure simply
leans to the right. Turn the figure so that the B of the B axis points directly up and this
becomes clear. It is now possible to read off the new rotated factor loadings on the rotated
axes. (The reader can confirm this by reading off and writing down the loadings of the
tests on the rotated axes of Figure 35.5.) Since the axes are kept at a 90-degree angle, this
is called an orthogonal rotation.
This example, though unrealistic, may help the reader understand that factor analysts
search for the unities that presumably underlie test performances. Spatially conceived,
they search out the relations among variables "out there" in muhidimensional factor
O A
Figure 35.5
space. Through knowledge of the empirical relations among tests or other measures, they
probe in factor space with reference axes until they find the unities or relations among
relations — if they exist.
'*
To guide rotations, Thurstone laid down five principles or rules of simple structure.
The rules are applicable to both orthogonal and oblique rotations, though Thurstone em-
phasized the oblique case. (Oblique rotations are those in which the angles between axes
are acute or obtuse.) The simple structure principles are as follows:
1 Each row of the factor matrix should have at least one loading close to zero.
2. For each column of the factor matrix there should be at least as many variables with zero or
near-zero loadings as there are factors.
3. For every pair of factors (columns) there should be several variables with loadings in one
factor (column) but not in the other.
4. When there are four or more factors, a large proportion of the variables should have negligi-
ble (close to zero) loadings on any pair of factors.
5. For every pair of factors (columns) of the factor matrix there should be only a small number
of variables with appreciable (nonzero) loadings in both columns.
In effect, these criteria call for as "pure" variables as possible, that is, each variable
loaded on as few factors as possible, and as many zeros as possible in the rotated factor
matrix. In this way the simplest possible interpretation of the factors can be achieved. In
other words, rotation to achieve simple structure is a fairly objective way to achieve
variable simplicity or to reduce variable complexity.
It
To understand
might look like
this,
ABC
imagine an ideal solution in
Tests
1
which simple structure is "perfect.
Factor Analysis • 583
II
1.0
O Ability Tests
.9
.8
.7
.6
.5
.4
.3
.2
^
,.3-.2
584 • Multivariate Approaches and Analysis
Type
I
Factor Analysis • 585
Factor Scores
While second-order factor analysis is more oriented toward basic and theoretical research,
another technique of factor analysis, so-called factor scores or measures, is eminently
practical, though not without theoretical significance. Factor scores are measures of indi-
viduals on factors. Suppose, like Lohnes and Marshall, we found two factors underlying
21 ability and grade measures. Instead of using all 21 scores of groups of children in
research, why not use just two scores calculated from the factors? Lohnes and Marshall
recommend just this, pointing out the redundancy in the usual scores of pupils. These
factor scores are, in effect, weighted averages, weighted according to the factor loadings.
Here is an oversimplified example. Suppose the factor matrix of Table 35.2 were
actual data and that we want to calculate the A and B factor scores of an individual. The
raw scores of one individual on the six tests, say. are: 7, 5, 5, 3, 4, 2. We muUiply these
scores by the related factor loadings, first for factor A and then for factor B, as follows:
The individual's "factor scores" are Fa = 13.98 and Fb = 7.99. We can, of course,
calculate other individuals' "factor scores" similarly.
This is not the best way to calculate factor scores.-^ The example was made up to
convey the idea of such scores as weighted sums or averages, the weights being the factor
loadings. In any case, the method, though not extensively used in the past, has great
potential for complex behavioral research. Instead of using many separate test scores,
fewer factor scores can be used. An excellent real example is described by Mayeske, who
participated in the reanalysis of the data of the Coleman report. Equality' of Educational
Opportunity.-'*
First, the scores of fifth-grade students on five achievement tests were weighted with
principal component weights (loadings) and added to obtain an overall achievement com-
posite scores (as in the above example). This was the dependent variable in many subse-
quent analyses. Second, the intercorrelations of sets of student variables and school varia-
bles were factor analyzed. Then "factor scores" were calculated to form indices for —
example, socioeconomic status, attitude toward life, training of teacher, experience of
teacher.These scores were used in multiple regression and other analyses. In short, more
than 400 variables were reduced to 31 "factor variables" calculated from the data of over
130.000 students in 923 schools, thus achieving considerable parsimony and increasing
the reliability and validity of the measures.
RESEARCH EXAMPLES
Most factor analytic studies have factored intelligence, aptitude, and personality tests and
scales, the tests or scales themselves being intercorrelated and factor analyzed. The
Thurstone example, below, an excellent example; indeed, it is a classic. Persons, or the
is
responses of persons, as we saw in Chapter 32, can also be factored. The variables entered
into the correlation and factor matrices, in fact, can be tests, scales, persons, items,
concepts, or whatever can be intercorrelated. The studies given below have been selected
not to represent factor analytic investigations in general, but rather to familiarize the
student with different uses of factor analysis.
Thurstone and Thurstone, in their monumental work on intelligence factors and their
measurement, factor analyzed 60 tests plus the three variables chronological age, mental
age, and sex.'^^ The analysis was based on the test responses of 710 eighth-grade pupils to
the 60 tests. It revealed essentially the same set of so-called primary factors that had been
found in previous studies.
The Thurstones chose the three best tests for each of seven of the ten primary factors.
Six of these tests seemed to have stability at different age levels sufficient for practical
school use. They then revised and administered these tests to 437 eighth-grade school
children. The main purpose of the study was to check the factor structure of the tests. In
other words, they predicted that the same primary factors of intelligence put into the 21
tests would emerge from a new factor analysis on a new sample of children.
The rotated factor matrix (oblique rotation) is given in Table 35.6. This is a remarka-
ble validation of the primary factors. The seven factors and their loadings are almost
exactly as predicted. (See, especially, the italicized loadings.)
Table 35.6 Oblique Rotated Factor Matrix, Thurstone and Thurstone Study^
Tests
Factor Analysis • 587
One of the most active, important, and controversial problems of behavioral scientific and
practical interest is the nature of mental abilities. Different theories with different amounts
and kinds of evidence to support them have been propounded by some of the ablest
psychologists of the century: Spearman, Thurstone, Burt. Thorhdike, Guilford, Cattell,
and others. There can be no doubt whatever of the high scientific and practical importance
of the problem. We have alluded, if only briefly, to the work and thinking of Thurstone
and Guilford. We now describe, also brietly, one among the many factor analytic studies
of Raymond Cattell.-''
The famous general factor of intelligence, ^. can be shown to be a second-order factor
that runs through most tests of mental ability. Cattell believes, in effect, that there are two
g's, or two aspects of g: crystallized and tluid. Crystallized intelligence is exhibited by
cognitive performances in which "skilled judgment habits" have become fixed or crystal-
lizedowing to the earlier application of general learning ability to such performances. The
well-known verbal and number factors are examples. Fluid intelligence, on the other
hand, is exhibited by performances characterized more by adaptation to new situations,
the "fluid" application of general ability, so to speak. Such ability is more characteristic
of creative behavior than is crystallized intelligence. If tests are factor analyzed and the
correlations among the factors found are themselves factored (second-order factor analy-
sis), then both crystallized and fluid intelligence should emerge as second-order factors.
Cattell administered Thurstone's primary abilities tests and a number of his own
mental ability and personality tests to 277 eighth-grade children, factor analyzed the 44
variables, and rotated the obtained 22 factors (probably too many) to simple structure. The
correlations among these factors were themselves factored, yielding eight second-order
factors. (Recall that oblique rotations yield factors that are correlated.) Although Cattell
included a number of personality variables, we concentrate only on the first two factors,
fluid intelligence and crystallized intelligence. He
reasoned that Thurstone's tests, since
they measure crystallized cognitive abilities, should load on one general factor, and that
his own culture-fair tests, since they measure fluid ability, should load on another factor.
They The two sets of factor loadings are given in Table 35.7, together with the names
did.
of the tests. The two factors also were correlated positively (r = .47), as predicted.
This study demonstrates the power of an astute combination of theory, test construc-
tion, and factor analysis. Similar to Guildford's equally astute conceptualization and
analysis of convergent, divergent, and other factors mentioned earlier, it is a significant
contribution to psychological knowledge of an extremely complex and important subject.
It is not often that factor analysis of the responses of random samples of subjects has been
done. Indeed, Thurstone even said that representative samples should not be used in factor
analytic studies."^ Verba and Nie, in a large sophisticated study of political participation
in the United States, report factors obtained from a random sample of over 3,000 citi-
zens. ^^ A distinction between exploratory
factor analysis and confirmatory factor analysis
is increasingly made Exploratory factor analysis is usually the use of
in the literature.
factor analysis to learn the factors underlying a set of variables or measures. Confirmatory
factor analysis is the use of factor analysis to test hypotheses about the factor structures of
sets of data. In confirmatory factor analysis, in other words, one sets up a model that
reflects aspects of a theory, and then somehow sees whether the model fits the observed
data. The Thurstone and Thurstone model examined earlier is a good example. The
factors were predicted before the data were gathered and the factor analysis done.-' The
point of this disquisition is that Verba and Nie's study of political participation is clearly
and consciously confirmatory factor analysis. They say, for instance,
it is important to be clear on what we are not doing. We are not looking to see what clusterings
among political acts we find. Rather, we are looking to see whether the clustering we expect to
find, given our analysis of the alternative characteristics of the modes of activity, is indeed
^o
found,
Verba and Nie's analysis of political activity led them to believe that four modes of
such activity lie behind political participation: citizen-initiated contacts, voting, campaign
activity, and cooperative activity.^' These are conceived to be different if related ways
that citizens can influence their government. The authors designed a questionnaire by
constructing several measures for each of the four modes of political activity: persuade
others how to vote, contribute money to party or candidate, contact local officials, for
example. There were thirteen of these measures that formed the questionnaire adminis-
Thurstone, Multiple Factor Analysis, p. xii. The wisdom of Thurstone's words was shown in a cross-
cultural study in which a referent attitude scale was administered to a random sample of the Netherlands and also
to a separate sample of University of Amsterdam students. The student factors were clear and readily interpreta-
ble, but those of (herandom sample were much less clear and difficult to interpret. See F. Kerlinger, C.
Middendorp, and J. Amon, "The Structure of Social Attitudes in Three Countries: Tests of a Criterial Referent
Theory." International Journal of Psychology. 11 (1976), 265-279. (Also reported in Kerlinger, Liberalism
and Conservatism: The Nature and Structure of Social Attitudes, chap. 7.)
-"S. Verba and N. Nie. Participation in America: Political Democracy and Social Equality. Copyright ©
1972 by Sidney Verba and Norman H. Nie. Reprinted by permission of Harper & Row, Publishers, Inc. The
obliquely rotated factor matrix of Table 35.8
is their Table 4-3, p. 65.
^'For an excellent discussion of the two approaches, see S. Mulaik. The Foundations of Factor Analysis.
New York: McGraw-Hill, 1972, pp. 361-366.
'"Verba and Nie, op. cit., p. 57.
51-54.
^'Ibid.. pp.
Factor Analysis • 589
Table 35.8 Rotated Oblique Factor Matrix, Thirteen Political Activities, Verba and Nie Study
590 • Multivariate Approaches and Analysis
devise a measure of it and test its "reality" by correlating data obtained with the measure
with data from other measures theoretically related to it. Factor analysis helps us check
our theoretical expectations.
Part of the basic life-stuff of any science is its Old constructs continue to
constructs.
be used; new ones are constantly being invented. Note some of the general constructs
directly pertinent to behavioral and educational research: achievement, intelligence,
learning, aptitude, attitude, problem-solving ability, needs, interests, creativity, conform-
ity. Note some of the more specific variables important in behavioral research: test anxi-
ety, verbal ability, traditionalism, convergent thinking, arithmetic reasoning, political
participation, and social class. Clearly, a large portion of scientific behavioral research
effort has to be devoted to what might be called construct investigation or construct
validation. This requires factor analysis.
When we talk about relations we talk about the relations between constructs: intelli-
gence and achievement, authoritarianism and ethnocentricism, reinforcement and learn-
ing, organizational climate and administrative performance —
all these are relations be-
tween highly abstract constructs or latent variables. Such constructs usually have to be
operationally defined to be studied. Factors are latent variables, of course, and the major
scientific factor analytic effort in the past has been to identify the factors and occasionally
use the factors in measuring variables in research. Rarely have deliberate attempts been
made to assess the effects of latent variables on other variables. With recent advances and
developments in multivariate thinking and methodology, however, it is clear that it is now
possible to assess the influence of latent variables on each other. This important develop-
ment will be discussed and illustrated in the next chapter on analysis of covariance struc-
tures. We will find there that the scientist can obtain indices of the magnitudes and
statistical significance of the effects of latent variables on other latent variables. If this is
so, then factor analysis becomes even more important in identifying the latent variables or
factors, and the scientist has to exercise great care in the interpretation of data in which the
influences of latent variables are assessed.
Factor Analysis • 591
Many research areas, then, can well be preceded by factor analytic explorations of the
mean that a number of tests arc thrown together and
variables of the area. This does not
given to any samples that happen to be available. Factor analytic investigations, both
exploratory and hypothesis-testing, have to be painstakingly planned. Variables that may
be influential have to be controlled — sex. education, social class, intelligence, and so
on.'" Variables are not put into a factor analysis just to put them in. They must have
legitimate purpose. If, for instance, one cannot control intelligence by sample selection,
one can include a measure of intelligence (verbal, perhaps) in the battery of measures. By
identifying intelligence variance, one has in a sense controlled intelligence. One can learn
whether one's measures are contaminated by response biases by including response-bias
measures in factor analyses.
The second major purpose of factor analysis is to test hypotheses. One aspect of
hypothesis-testing has already been hinted: one can put tests or measures into factor
The design of
analytic batteries deliberately to test the identification and nature of factors.
such studies has been well outlined by Thurstone, Cattell, Guilford, and others. First,
factors are "discovered." Their nature is inferred from the tests that are loaded on them.
This "nature" is set up as a hypothesis. New tests are constructed and given to new
samples of subjects. The data are factor analyzed. If the factors emerge as predicted, the
hypothesis is to this extent confirmed, the factors would seem to have "reality." This will
certainly not end the matter. One will have to test, among other things, the factors'
relations to other factors. One will have to place the factors, as constructs, in a nomologi-
cal network of constructs.
A less well-known use of factor analysis as a hypothesis-testing method is in testing
experimental hypotheses.''' One may hypothesize that a certain method of teaching read-
ing changes the ability patterns of pupils, so that verbal intelligence is not as potent an
influence as it is with other teaching methods. An experimental study can be planned to
test this hypothesis. The methods can be assessed by factor analy-
effects of the teaching
ses of a set of tests given before and after the different methods were used. Woodrow
tested a similar hypothesis when he gave a set of tests before and after practice in seven
tests: adding, subtracting, anagrams, and so on.-^'* He found that factor loading patterns
a factor a name does not give it reality. Factor names are simply attempts to epitomize the
essence of factors. They are alwa > tentative, subject to later confirmation or disconfirma-
tion. Then, too, factors can be produced by many things. Anything that introduces correl-
ation between variables "creates" a factor. Differences in sex, education, social and
cultural background, and intelligence can cause factors to appear. Factors also differ at —
least to some extent —
with different samples. Response sets or test forms may cause
factors to appear. Despite these cautions, it must be said that factors do repeatedly emerge
with different tests, different samples, and different conditions. When this happens, we
have fair assurance that there is an underlying variable that we are successfully measuring.
There are serious criticisms of factor analysis. The major valid ones center around the
indeterminacy of how many factors to extract from a correlation matrix and the problem of
how to rotate factors. Another difficulty that bothers critics and devotees alike is what can
'^J. Guilford, "Factorial Angles to Psychology." Psychological Review. 68 (1961). 1-20. This is an
important article that any investigator who uses factor analysis should study.
"B. Fruchter. "Manipulative and Hypothesis-Testing Factor-Analytic Experimental Designs." In R. Cat-
tell, ed.. Handbook of Multivariate Experimental Psychology. Skokie. 111.: Rand McNally, 1966. chap. 10.
'""H. Woodrow, "The Relation between Abilities and Improvement with Practice." Journal of Educational
Psychology. 29 (19.^8). 215-230.
592 • Multivariate Approaches and Analysis
be called the communality problem, or what quantities to put into the diagonal of the R
matrix before factoring. In an introductory chapter, these problems cannot be discussed.
The reader is Harman, and Thurstone. A
referred to the discussions of Cattell, Guilford.
criticism of a different order seems and sociologists and some psy-
to bother educators
chologists. This takes two or three forms that seem to boil down to distrust, sometimes
profound, combined with antipathy toward the method due to its complexity and,
strangely enough, its objectivity.
The argument runs something like this. Factor analysts throw a lot of tests together
into a statistical machine and get out factors that have little psychological or sociological
meaning. The factors are simply artifacts of the method. They are averages that corre-
spond to no psychological reality, especially the psychological reality of the individual,
other than that in the mind of the factor analyst.''"'' Besides, you can't get any more out of
a factor analysis than you put into it.
The argument is basically irrelevant. To say that factors have no psychological mean-
ing and that they are averages is both true and untrue. If the argument were valid, no
scientific constructs would have any meaning. They are all. in a sense, averages. They are
all inventions of the scientist. This is simply the lot of science. The basic criterion of the
"reality" of any construct, any factor, is its empirical, scientific "reality.'" If. after
uncovering a factor, we can successfully predict relations from theoretical presuppositions
and hypotheses, then the factor has "reality." There is no more reality to a factor than
this, just as there is no more reality to an atom than its empirical manifestations.
The argument about only getting out what is put into a factor analysis is meaningless
as well as irrelevant. No competent factor analytic investigator would ever claim more
than this. But this does not mean that nothing is discovered in factor analysis. Quite the
contrary. The answer is, of course, that we get nothing more out of a factor analysis than
we put into it. but that we do not know all we put into it. Nor do we know what tests or
measures share common factor variance. Nor do we know the relations between factors.
Only study and analysis can tell us these things. We may write an attitude scale that we
believe measures a single attitude. A factor analysis of the attitude items, naturally,
cannot produce factors that are not in the items. But it can show us, for example, that there
are two or three sources of common variance in a scale that we thought to be unidimen-
sional. Similarly, a scale that we believe measures authoritarianism may be shown by
factor analysis to measure intelligence, dogmatism, and other variables.
If we examine empirical evidence rather than opinion, we must conclude that factor
analysis is one of the most powerful tools yet devised for the study of complex areas of
behavioral scientific concern. Indeed, factor analysis is one of the creative inventions of
the century, just as intelligence testing, conditioning, reinforcement theory, the opera-
tional definition, the notion of randomness, measurement theory, research design, multi-
variate analysis, the computer, and theories of learning, personality, development, orga-
nizations, and society are.^^
It is fitting that this chapter conclude with some words of a great psychological scien-
tist, teacher, and factor analyst, Louis Leon Thurstone:
As scientists, we have and personalities of people are not so complex
the faith that the abilities
We believe that these traits are made up
as the total enumeration of attributes that can be listed.
of a smaller number of primary factors or elements that combine in various ways to make a long
list of traits. It is our ambition to find some of these elementary abilities and traits. . . .
"See G. Allpon. Pattern and Growth in Personality. New York; Holt. Rinehart and Vv'inston, 1961, pp.
329-330; G. Allport, Personality. New York: Holt, Rinehart and Winston, 1937, pp. 242-248.
^'K. Deutsch, J, Piatt, and D. Senghaas, 'Conditions Favoring Major Advances in Social Science." Sci-
ence. 171 (1971), 450-459.
Factor Analysis • 593
All scientific work has this in common, that we try to comprehend nature in the most
parsimonious manner. An explanation of a set of phenomena or of a set of experimental obser-
vations gains acceptance only in so tar as it gives us intellectual control or comprehension of a
relatively wide variety of phenomena in terms of a limited number of concepts. The principle of
parsimony is intuitive for anyone who has even slight aptitude for science. The lundamental
motivation of science is the craving for the simplest possible comprehension of nature, and it
finds satisfaction in the discovery of the simplifying uniformities that we call scientific laws."
ADDENDUM
Sample Size and Replication
Two desiderata, even necessities, of factor analysis are large samples and replication. A
general rule is: samples as possible. Like any statistical procedure, factor
Use as large
analysis is subject to measurement and sampling error, and the reliable identification of
factors and factor loadings requires large N's to wash out error variance. This is especially
true for item factor analysis, because item intercorrelations are usually lower and less
reliable than test intercorrelations. A loose but not bad rule-of-thumb might be: ten sub-
jects for each variable (item, measure, etc.).
Replication is too seldom practiced in any research. And it is particularly needed in
factor analytic studies. The "reality" of factors is much more compelling if found in two
or three different and large samples. Factor loadings and patterns of loadings, like regres-
sion coefficients —
which, by the way, they are —
are often unstable, especially in smaller
samples. A good rule is: Replicate all studies. This does not mean literal duplication of
studies. Indeed, the word "replication" means doing additional studies based on the same
problems and variables but with minor, sometimes major, variations. For example, the
measurement instrument of an original study may have been found wanting. A replication
of the study done with another sample and an improved instrument and similar results
would be compelling evidence of the empirical validity of the original results.
Study Suggestions
1. Fortunately, there are good, even excellent, books and articles on factor analysis. Unfortu-
nately, there is as yet no satisfactory and up-to-date book written at an elementary level. So, to learn
factor analysis, one has to work hard at it, using more advanced texts. It is suggested that the student
who will not take a course in factor analysis use either the Harman (footnote 5) or the Thurstone
(footnote 6) text or both. Both are definitive but rather difficult.
2. The more advanced student will find the following selected articles valuable;
as a general factor —
a common practice, by the way —
Carroll settles the issue of what a
general factor is.
COAN, R. "Facts, Factors, and Artifacts: The Quest for Psychological Meaning." Psychologi-
cal Review, 71 (1964), 123-140. A good general theoretical article on factor analysis, with
discussion of the interpretation of factors.
^''L. Thurstone, The Measurement of Values. Chicago: University of Chicago Press, 1959, p. 8.
594 • Multivariate Approaches and Analysis
3. The individual who wishes a broad overview — with considerable specificity, however — has
a few excellent sources available. Among the following general references, the French monograph
and the French et al. reference test kit are valuable. Rather well-established cognitive factors are
named, described, and illustrated. The kit is very valuable.
4. As usual, there is no substitute for the study of actual research uses of methods. The student
should, therefore, read two or three good factor analytic studies. Select from those cited in the
'*
A. Hendrickson and P. White, "PROMAX: A Quick Method for Rotation to Oblique Simple Structure,"
British Journal of Statistical Psychology. 17 (1964), 65-70.
Factor Analysis • 595
Analysis of Covariance
Structures
In this long and involved dissertation on the foundations of behavioral research, we have
often talked of the importance of theory and the testing of theory. We have from time to
time stressed the purpose of scientific research as formulating explanations of natural
phenomena and submitting implications of the explanations to empirical test. In this
chapter, we study and try to understand a highly developed and sophisticated conceptual
and analytic system to model and test scientific behavioral theories: analysis of covariance
structures. To do this, we focus largely on the powerful mathematical-statistical system
and computer program LISREL {Linear Structural y?e/ations), conceived and developed by
Joreskog and his colleagues to set up and analyze covariance structures.'
Unfortunately, analysis of covariance structures and LISREL are hard to learn and to
use. The difficulty, it must be confessed, is to explain the system in language comprehen-
sible to nonmathematical readers and, at the same time, to stay within the purposes and
confines of this book. So the discussion is limited to presenting and explaining the bare
mathematical skeleton of the system and how and why it is used. Fortunately, our subject
is closely related to the discussions of multiple regression analysis and factor analysis of
'
K. Joreskog and D. Sorbom, LISREL-V: Analysis of Linear Structural Relationships by Maximum Likeli-
hood and Least Squares Methods. Uppsala. Sweden: University of Uppsala, 1981 (henceforth we refer to this
book as the Manual)', K. Joreskog and D. Sorbom, Advances in Factor Analysis and Structural Equation Models
(edited by J. Magidson). Cambridge, Mass.: Abt Books, 1979. The latter reference is a collection of papers by
Joreskog and Sorbom. The former reference is the manual that accompanies the fifth version of the computer
program LISREL.
Analysis of Covariance Structures •
597
Liberalism Conservatism
I 1
Conservatism
Liberalism
-F. Kerlinger, The Structure and Content of Social Attitude Referents: A Preliminary Study," Educa-
tional and Psychological Measurement, 32 (1972), 613-630.
598 • Multivariate Approaches and Analysis
Table 36.1
Analysis of Covariance Structures • 599
6,
600 • Multivariate Approaches and Analysis
tions among 1,2, and 3, and substantial and positive correlations among 4, 5, and 6.
These are the italicized r's in the table. The duality hypothesis also predicts zero or
near-zero r's between the C variables (1.2, and 3) and the L variables (4. 5, and 6). We
call these cross-correlations
Study of the R matrix seems to show that the duality hypothesis is supported because
the italicized r's in Table 36.2 are substantial and positive, and the cross-correlations are
low and near- zero. But two of the cross r's are —.237 and -.225. Although not statisti-
cally significant, they are still not zero and we are left in doubt. Besides, this example is
simple; most examples encountered in practice are not so simple. In other words, we need
a better method to test the duality hypothesis. The method actually used in analysis of
covariance structures is as follows. The data are analyzed according to the model setup, in
this case the duality model: two orthogonal factors (see Figure 36. 1 ). From the parameters
estimated by the data analysis, factor analysis in this case, an R matrix is calculated by
using the estimated parameters of the theoretical model. This is done by writing equations
for each of the .v's.
To help us clearly understand what is done and why, we first set up the two theories in
path diagrams. Behavioral researches who use "modeling" or "causal modeling," as it
squares for observed variables and circles for unobserved or latent variables. Single-
headed arrows are used to indicate influences, double-headed arrows to indicate correla-
tions. In other words, the path diagrams used in analysis of covariance structures follow
much the same principles and practices of path analysis discussed earlier.
To set up the equation of a model, we "define" each variable that is at the end of an
arrow. For example, from Figure 36.1, A, we write for .Vn and for Xs2?
The values or parameters to be estimated are: An and 5, and A52 and 65. ^1 and ^2^ of
course, are latent variables or factors I and II, and thus are not estimated. Their "effects,"
or All and A52, are estimated. Naturally, we must have six equations since there are six
jr's. These are written as matrices:'*
(36.3)
\ = X^^+ S (36.4)
'Double subscripts of variables are necessary to identify them unambiguously. In the duality model, the
factor loadings needtwo subscripts i andy. where is the row (variables) designation andy is the column (factor)
;
designation. Look at the factor matrix a. Table 36.3. The A's (lambda) have two subscripts, e.g.. A,, means
lambda, the factor loading of the third row (/ = 3) and the second column (7 = 2).
"The necessity of knowing matrix algebra should be apparent. The reader who does not know matrix
algebra can take the development on faith.
Analysis of Covariance Structures • 601
Table 36.3 Factor Matrices of (a) Ordinary Factor Analysis and (b) LISREL Constrained Factor
Analysis
a. Ordinary Factor
602 • Multivariate Approaches and Analysis
rotated factors of an ordinary factor analysis are given in a, and the LISREL constrained
solution is given in b. You may well ask: What happens
where the to the factor loadings
O's are in b? The point is that b expresses the "pure" form of the duality hypothesis. As
said earlier, the computer is instructed to do the calculations keeping the O's of Tables
36.3 and 36.4 intact. But how about the fairly large negative loadings, —.44 and —.36 in
a, the conventional factor analysis'? Both are substantial, negative, and statistically signif-
icant, contrary to the duality hypothesis. They are deviations from the duality model. The
key question, then, is: Are the deviations large enough to invalidate the hypothesis, which
specifies O's? We will return to this point shortly.
The model of Figure 36.1, A, and Equation 36.3 requires calculation of the error
terms,8, (delta). The six error terms were calculated, but we are not interested in the
method of calculation. Much more interesting and relevant to the duality hypothesis is the
estimation of variances and covariances of the <I> (phi) matrix because it expresses the
relations between ^i and ^i. the factors. Remember that the duality hypothesis included
the correlation between the two factors: it will be zero or close to zero. Look back at
Figure 36.1, A. and note that, in accordance with the duality hypothesis <^2i = ''12 = 0.
While riT can be constrained to be zero, we chose, instead, to let LISREL estimate (f>2\ for
reasons to be given later. The variances of ^i and ^2 were set equal to 1 .00, (f)^ = (j>22 = 1
1.00, and r,2 or <^2i was "free." (When a parameter is "free" in LISREL, the program
estimates its value.)
ri2 = -.15, not
statistically significant. So. in effect, the two factors are orthogonal,
which is consistent with the duality hypothesis. Recall that the theory says that conserva-
tism and liberalism are separate and independent dimensions of social attitudes. This
means, of course, that the correlation between them is zero (or close to zero).
The crucial question, however, is: Is the whole model congruent with the data? The
whole model of the duality hypothesis is expressed by Figure 36.1, A, and by equations
36.3. Following the rules of LISREL, we instruct the computer to estimate the six factor
loadings. An, A21. A31, A42, A52, A62 of the factor matrix A^, while maintaining the zero
constraints in the matrix. We also specify that the error terms of the six equations of 36.3
be calculated. We
must also specify what the relations between the two factors will be;
therefore we must instruct the computer what to do with the phi (<I>) matrix. We set
</>! = <A22 = 00. These are the diagonal entries of <P. which are the variances of ^1 and
1 1
^2- We also instruct the computer to estimate <^2i (which is ^12). Following an iterative
procedure the computer estimates the 13 values we have specified to be estimated: An,
A21, A31, A42, A52, A62; 8]. 82. Si., 84, Ss, Sf,: (t>2i- using the correlations among the six
variables (Table 36.2) as input data.'' It also constrains the zeroes of Table 36.3 and sets
the phi's: (^n = ^22 = 100. The factor loadings, or A^, are given in Table 36.4, b. and
fkn = —.15.
''12 =
The six error terms are: .58, .25, .61, .50, .71, and .60. Are these
values congruent with the data, or alternatively, does the duality model "fit" the data?
The core idea behind the assessment of the "goodness of fit" of a theoretical model is
simple and powerful. Use the estimated parameter values and the constrained values to
calculate a predicted correlation matrix, R*, and then compare this predicted matrix to the
obtained or observed correlation matrix, R. This can be done by subtracting R* from R,
or R — R*. This matrix of differences is called a residual matrix.^ The residual matrix
under the duality hypothesis is given above the diagonal of Table 36.5. Study of these
'There are a number of other important methodological points we do not discuss. One of these is the
assumptions behind the analysis. To do maximum likelihood analysis, for example, it must be assumed that the
distribution of the .v variables is multivariate normal. Another assumption or requirement is identification; the
LISREL problem must be set up so that all estimated parameters can be identified.
''The matrix R* can in this case be generated by multiplying the rows of Table 36.4: r,, = (.65)(.87) +
(0)(0) = .57; r,, = (.65H.63) + (0)(0) = .41; r,, = (.57)(.41) + (0)(0) = .55; and so on.
Analysis of Covariance Structures • 603
Table 36.5 Residuals Calculated from the Duality Hypothesis Model (Above Diagonal) and from
the Bipolarity Hypothesis Model (Below Diagonal)"
604 • Multivariate Approaches and Analysis
The factor loadings are interesting and informative. Those of the three conservative
measures, X\,X2, and x^, are positive and substantial, but those of the liberal measures are
all low. Evidently the one-factor model is inadequate: the three liberal measures are
"lost." The x^ is also significant, indicating a lack of fit. Now look at the residuals in the
lower half of Table 36.5. Note carefully that the residuals forr45, r46, and rs^are substan-
tial: .416, .396, and .388. The correlations among the liberal measures, .V4, A5, and Xf„
were "missed" by the one-factor solution, the model for the bipolarity hypothesis. It
seems that the bipolaritymodel has not succeeded too well. The duality model, on the
other hand, made out well on all counts.
We now make a final test: we directly compare the two models. This is done through
the x^'s. The x'' for the bipolarity model was 28.00, at 9 degrees of freedom, while the )^
for the duality model was 10.15, at 8 degrees of freedom. Recall that earlier we had the
computer estimate rj?, or (^21. even though, strictly speaking, we should have "fixed" it
at zero, or (/)2i = 0, since the pure duality model predicts orthogonal factors. One of the
main reasons fordoing this was to "use up" one degree of freedom so that the;^'"'s of the
two models could be compared. The direct test is ;^,p - xiu' or 28.00 - 10. 15 = 17.85.
The degrees of freedom are also subtracted: 9 — 8 = 1. Had we not estimated 4>2\^ the
degrees of freedom for both models would have been the same, making a )f comparison
impossible, x^ = 17.85, 2A df = 1, is evaluated. It is highly significant, which indicates
the superiority of the duality hypothesis (since the bipolarity model x^ is significantly
larger than the duality model x). If there is no significant difference between the x's, of
the two models, then the bipolarity hypothesis is as "good" (or as "poor") as the duality
^This is difficult to show and explain problem has been done. A more elegant approach is as
the way the
follows. Set up the duality model as it Then set up the bipolarity model exactly the same
has been done above.
except for the <t' matrix. For the duality model estimate <I);i, as above. This will yield a :^^ withrf/= 8. Now set
up the bipolarity model fixing <I)2i = 1.00. with i// = 9. This will yield exactly the same parameter estimates as
if the program had been told that there was only one factor, except that the one-factor loadings will appear on
two factors. Since the correlation between the two factors is 1.00, the net effect is the same as with one factor.
The test of the alternative hypotheses, ;^^,p - xi^. will be the same as that given above, but it is now clear that
the two models differ only in the one parameter, <^2i- This is one of the reasons for estimating d>:i, or r,,, in
the duality model: for a test of alternative hypotheses, there must be a difference in degrees of freedom.
Moreover, one model must be a subset of the other model. This means that both models estimate the same
parameters except (in this case) for one parameter.
Analysis of Covariance Structures • 605
Crete to refer to. The example is a small model of ability and achievement. We say, in
effect; Verbal Ability and Numerical Ability intluence Achievement positively. Although
perhaps not terribly interesting, the example has the virtue of being obvious and easily
understood. No attempt is made to test alternative hypotheses here even though there are a
number of possibilities. We seek only to convey the essence of the system.
The three parts of the system mentioned above are expressed in the following equa-
Measurement Equations:
x: X = A,^ + S (36.4)
Structural Equation:
We are familiar with 36.4. the .v measurement equation. It succinctly expresses factor
analysis, as we have seen. The v measurement equation, 36.5, is identical in form to the
x measurement equation. It therefore also expresses factor analysis. Indeed the factor
analysis of the attitude example could have been done with the y measurement equation.
Both equations express the relations between measured or observed variables x and y and
latent or unobserved variables: x and ^ and A^ go together, and y, tj, and A,, go together.
That is, the latent variables ^ and 17 underlie the observed variables x and y, and the entries
of Aj and A,,, the factor loadings, A. say "how much" ^ and t] underlie the x and y
measures.
The most interesting and perhaps most
so-called structural equation, 36.6, is the
^ of the x system and 17 of
'
important part of the system. ' It relates the latent variables,
the y-system, to each other. Look at this system and its function from a regression point of
view. First, regard the .v-side of Figure 36.2. We have four tests: Verbal Test 1, Verbal
Test 2, Numerical Test 1, and Numerical Test 2, .ti, xt, x^, and .r4. We calculate the
four-by-four correlation matrix and factor analyze it, and obtain the two factors ^1 and ^2<
as in Figure 36.2. The lambda- jr, Aj, matrix ordinarily contains the factor loadings: An,
A|2. A21, A22, A3,, A32, A41, A42. In our case it will contain only four of these, the four in
Figure 36.2: An, A21, A32, and A42. The other lambdas will be set at zero as we did earlier
with the attitude duality hypothesis:
Tests I II
1 A,i
2 A2,
3 A,2
4 A42
'"The Greek symbols used in LISREL are: A,, lambda .v. A,: lambda y; ^: xi. 17: eta. S: delta, e: epsilon, B:
beta, /3: beta, lower case; F: gamma, y: gamma, lower case, f: zela. Each of these symbols stands for a matrix,
or, in two cases, for latent variables. We use the Greek symbols because students who undertake to learn
LISREL will have to learn the Greek symbols anyway. Note that F is capital gamma and y lower-case gamma.
In addition, the matrix 4> (phi) is used but not in the above equations.
'
' In the LISREL-V version of the system. Equation 36.6 is written a little differently: r) = Bri + rx + (.
This was done for a minor technical reason and is not as conceptually clear as Equation 36.6, which we use
because it clearly shows the relation between the latent variables. 77 and e, to each other.
606 • Multivariate Approaches and Analysis
Analysis of Covariance Structures • 607
are not congruent with the data can easily be stated, invalidating the whole model. For
example, the y variables, which we said measured reflections of one factor or latent
variables, might be incorrect. Perhaps two factors are necessary. That is. Figure 36.2 has
one factor, tj, for Achievement. But there may really be two factors, tji, and 172. After all,
V, is a reading test and yi is a mathematics tests, and we know that these are usually two
We finally arrive at the crucial relation: that between ^1 and ^2. the latent independent
variables, and 17, the latent dependent variable. Our substantive hypothesis may state that
Verbal Ability, ^|, and Numerical Ability, ^2. both influence Achievement, 17.
This is not too fascinating a hypothesis but one amenable to example and explanation.
In order to test it, we must set up the problem and model of Figure 36.2 in matrices. This
is a crucial and difficult step in LISREL. At the risk of provoking boredom, let us pursue
the ideas and set them up in equations and matrix equations, after spelling out the individ-
ual variable equations. First, the x equations. The equation for Xi is repeated here: Xi =
Aii^i + 81. Similarly, X2, x^, and x^ are set up:
•^1 = All fi + Si
'"'
- ^' !' 1 i" (36-7)
X3 - A32 ?2 + O3
X4 = A42 ^2 + ^4
A42/ \8j
(Students should pause here, study Figure 36.2 and Equations 36.7 and 36.7a, and try to
understand their meaning.)
The v-side is a bit easier:
>'l
= Ai 7J
+ €1
3^2 = A2 Tj + C2 (36.8)
yj = A3 17 + C3
In matrix form:
(36.8a)
The structural equation matrices must also be set up. The structural equation was
given in 36.6; it is repeated here for convenience:
B Tj = r ^+ f (36.9)
17 is the latent variable of the y-side of the problem, and f is the latent variable (or
variables) of the jc-side. B (beta) and F (gamma) are coefficient or weight matrices, f
(zeta) is a matrix of so-called disturbance terms, or errors in the structural equation. Let's
write out the individual equation:
Notice that B, the beta matrix, has dropped out since there is only one tj. B spells out the
relations among the 17 or latent v variables if there is more than one 17. For example, if we
had two Tj's, or tji and 172, it may well be that 17, influences tjt, and we would want to
assess this influence.
As usual, we must write Equation 36. 10 in matrix form so that it can be prepared for
LISREL analysis:
The parameters yj and 72 (gamma 1 and gamma 2) are the most important part of the
problem because they estimate the effects of the latent variables. Verbal Ability (^1) and
Numerical Ability (^2). on Achievement (17). The three sets of equations, 36.7a, 36.8a,
'-
and 36.10a, then, spell out the LISREL problem.
A fictitious correlation matrix was synthesized so that the LISREL solution would
support the model of the path diagram of Figure 36.2 and the equations written on the
basis of the diagram.'^ The results were very nice, indeed. )f = 7.85, which, at II
degrees of freedom, is not significant (p = .73). The model fits: it is congruent with the
data, rms = .03, a low value indicating that the residuals were small. Other indices
calculated by LISREL all supported the conclusion that the model was satisfactory. (This
is no great achievement, of course: I set up the example so that the solution would be
satisfactory!)
Although the parameters of the x and y factor analyses (or regression analyses) are also
satisfactory, they are not reported because our interest is in testing the model for congru-
ence with the data, in this case a correlation matrix, and in assessing the relations between
the latent variables ^1 and ^2- or Verbal Ability and Numerical Ability, on the one hand,
and 17, Achievement, on the other hand. The T coefficients, yi and y2, express these
influences. The values were yi = .42 and y2 = .35, both statistically significant (p <
.01), Verbal Ability and Numerical Ability have moderate positive and statistically signif-
icant "influences" on Achievement. In the language of Chapter 33, the regression of 17,
Achievement, on ^1, Verbal Ability, and ^2. Numerical Ability, is moderate and statisti-
cally significant. LISREL also calculates what is in effect R^ for the structural equation:
^^,^,^, =
.437.''* This is a moderate to substantial relation. It can be interpreted as the
proportion of variance of the _v variables, as expressed by the dependent latent variable. 17,
accounted for by the latent independent variables, ^1 and ^2- This index, then, is a multi-
variate expression of the multiple regression of 17 on ^1 and ^2-
It was said earlier that analysis of covariance structures and LISREL, the computer
program to do the necessary complex computations, are difficult, hard to learn, and hard
to use. Why bother with it, then? Can't the factor analyses and the regression analyses be
'"We do not try to explain how this is done since it is almost purely technical and would probably require
another chapter.
"This is by no means easy to do because one can easily write correlations some of which are inconsistent
with each other. In such cases LISREL will either stop or yield nonsense. In this case I was lucky: the solution
was justwhat wanted.
I
'"The calculation is accomplished by using the idea developed in Chapter 26 of calculating reliability using
the error term: r„ = I - VJV, (see Equation 26.4, Chapter 26). It is also based on the theory of matrices and
determinants. In LISREL. note that error in the structural equation is provided by the matrix (psi), whose ^
individual terms are f's. zetas. In the present example there is only one (. Thus the calculation of /?-,, f^f, =
1 - |^|/cov(7)) = 1 - (.382/. 679) = .437 is easy. |^| means '"the determinant of the matrix psi." The values
.382 and .679 are recovered from the computer output. See Joreskog and Stjrbom, LISREL Manual, p. 1.38, for
definition of the formula and P. Green, Melhodological Tools for Applied Multivariate Analysis. New York:
Academic Press, 1976, pp. 122-124, for a discussion of a generalized variance measure, which is the basis of
the above equation.
Analysis of Covariance Structures • 609
done separately with far less wear and tear on the behavioral scientist? Yes and no. The
separate factor analyses of the y and v variables can certainly be done separately: indeed,
psychometric and factor analytic study should be done before LISREL is used. But the
general regression analysis just described obviously cannot be done separately because it
depends on the v and y analyses. One may of course try various approaches to the analysis
of the data. But there appears to be no simple way to study sets of complex relations and
to test the congruence of theoretical models with observed data. The ideas of analysis of
covariance structures are mathematically and statistically powerful, conceptually pene-
trating, and aesthetically satisfying. The conception of LISREL and other computer
programs''' are highly ingenious, productive, and creative achievements. They are, at the
present writing, the highest development of behavioral scientific and analytic thinking, a
development that brings psychological and sociological theory and multivariate mathe-
matical and statistical analysis together into a unique and powerful synthesis that will
probably revolutionize behavioral research. It is in this sense that analysis of covariance
structures is said to be the culmination of contemporary methodology.
RESEARCH STUDIES
In the relatively short time that analysis of covariance structures and LISREL have been
functional and available — since the early and middle 1970s — the approach has been
fruitfully used in a number of fields. Some of these studies are LISREL reanalyses of
existing data; others are studies that were conceived with analysis of covariance structures
in mind. (See Study Suggestion 2 at the end of the chapter.) The first study of attitude
structure discussed in this chapter was only one of twelve sets of attitude data that were
reanalyzed using LISREL. Most of the evidence supported the duality hypothesis.'^
Joreskog and his colleagues have reanalyzed the data of a number of psychological and
sociological studies.''' The first study described in detail below is a LISREL reanalysis of
the data of a large study of political participation in America, a study discussed in consid-
erable detail in Chapter 35.
Bentler and Woodward used LISREL to reanalyze Head Start data — with depressing
results."* They found that theprogram had no significant effects on the Head
Head Start
Start children's cognitive abilities. Judd and Millbum studied the attitude structure of the
general public of the United States. ''' Using panel data from surveys done in 1972, 1974,
and 1976, they investigated Converse's contention that the general public does not have
meaningful and stable social attitudes. They found that the noneducated public does have
consistent ideological predispositions.
'^The present discussion of LISREL seems lo imply that the work of Joreskog and his colleagues is a unique
contribution. Not so. Many others have contributed to the field, and Joreskog has explicitly and repeatedly cited
the sources of his work and acknowledges his indebtedness. And there will certainly be further developments
and other computer programs. Joreskog appears to be the first analyst, however, to bring psychometric theory
and practice, factor analysis, regression theory, and computer analysis into a productive and functional syn-
thesis.
""F. Kerlinger, "Analysis of Covariance Structure Tests of a Criterial Referents Theory of Attitudes,"
Multivariate Behavioral Research. 15 (1980), 403-422.
"One of the best articles is also one of the earliest. The computer program was called ACOVS: K. Joreskog,
"Analyzing Psychological Data by Structural Analysis of Covariance Matrices." In K. Joreskog and D. Sor-
bom. Advances, op. cii.. chap. 3. Later examples of Joreskog reanalyses have appeared in the five or six
LISREL manuals.
'*P. Bentler and J. Woodward, "A Head Start Reevaluation: Positive Effects Are Not Yet Demonstrable,"
Evaluation Quarterly. 2 (1978). 493-510.
'''C. Judd and M. Milbum, "The Structure of Attitude Systems in the General Public: Comparisons of a
Structural Equation Model," American Sociological Review, 45 (1980), 627-643.
610 • Multivariate Approaches and Analysis
. .
4>4i are the correlations among the factors. Remember that when the factors are
. ,
correlated, the solution is oblique. The program instructed the computer to calculate the
correlations among the parameters of Figure 36.3 and then to use the parameters to
calculate a predicted correlation matrix R*. Finally, adequacy of the fit of the
to assess the
model of the four oblique factors of Figure 36.3, R- R*, the differences or residuals,
and various "fit" statistics were calculated.
Analysis of Covariance Structures • 611
Table 36.6 Correlations Among Thirteen Political Variables, Verba and Nie Study, Above Diago-
nal; Residuals, Below Diagonal
612 • Multivariate Approaches and Analysis
i 1
Family
SES \
Past Academic
T
Achievement
X, =.90
% School Popularity,
White Whites
% School Popularity,
White Whites
Figure 36.4 ^i: SES: Socioeconomic Status: ^< '^'^ White in School: tji:
the notion that achievement influences peer acceptance prompted Maruyama and Miller to
reanalyze Lewis and St. John's data. In other words, from peer acceptance
is the influence
to achievement, or is it from achievement to peer acceptance? The relation and the theory
behind it are theoretically and practically important. One of the major justifications for
desegregation is that white peer acceptance of black children will affect the achievement
of the black children.
The LISREL model used is given in Figure 36.4. Careful attention is necessary: this is
themost complex model we have yet encountered. It is also interesting and instructive,
and we will use it here to reinforce earlier discussion and to make two or three new
methodological points. What is important are the relations and implied influences among
the latent variables, but especially among the v latent variables, t]\. Past Achievement, 172:
Popularity with whites, and t],; Present Achievement. The solid arrows in Figure 36.4
which are epitomized by -y (gamma) or /3 (beta) coeffi-
indicate significant influences,
cients.Broken arrows indicate statistically not-significant coefficients. I have kept
Maruyama and Miller's diagram and have only changed it slightly in conformity with
earlier usage in this chapter. I have also inserted the coefficients" values to facilitate
discussion and interpretation.
)C = 4.88, which, at 5 degrees of freedom, is not significant. The model fits. (The
authors do not report rms or other indices of fit. Had the size of the sample been larger it —
was N= 154 — would probably have been significant.) The y coefficients are not
the x^
important for our purpose; so we concentrate on the relations among the latent 17 (eta)
variables: 171: Past Academic Achievement expresses the observed variables vi: Intelli-
gence, and V2: GPA, 5, or Grade-Point Average, Grade 5; 172: Popularity with Whites; and
Analysis of Covariance Structures • 613
T),: Present Achievement, which expresses the observed variables y^: GPA. or Grade-
Point Average, Grade 6, and vs: Reading. We must now discuss an aspect of LISREL that
we skimped earlier; the nature and use of the B (beta) matrix and the directional (regres-
sion) coefficients fi.
Earlier it was said that our main interest is usually focused on the intluences of the ^
(Xsi) latent variables on the tj (eta) latent variables, as expressed by the F (gamma)
coefficient matrix: yn, y^i, etc. That is, the ys express the influences of the ^'s, or
independent latent variables, on the 17's, or dependent latent variables. Some problems,
however, do not quite fit mold, and the present problem is a case in point.
into this usual
When a latent variable is an "intermediary" variable, a variable that is influenced by
another independent latent variable and that, in turn, influences a dependent variable, then
it must be treated as a dependent latent variable, 17. Look at the middle horizontal layer of
17, influences 17,. The magnitudes of the influences are expressed by the y and the /3. 77,
(beta) coefficients, which express the influence of one tj on one or more other tj's. Earlier
in this chapter we mentioned the difference between the x and y parts of the LISREL
system. The <I> (phi) matrix contains </> coefficients that do not express the influence of
one ^ on one or more other ^'s. They are only covariances or, in our case, correlations. In
short, the )3"s are regression coefficients and the <I>'s are correlations. Therefore, if a latent
variable is an "intermediary" variable, it has to be designated as an 17 latent variable.
If the cultural transmission hypothesis is correct, then black children's acceptance by
white children should influence the black children's achievement positively and substan-
tially. But /S? in Figure 36.4 is only .02. To make these crucial relations quite clear, we
reproduce the 17 latent variables in Figure 36.5. Notice that the influence of Past Achieve-
ment, on Present Achievement, 173, is strong, ^2 = -98, as we would expect. But the
Tji,
significant, /3| = .38. Maruyama and Miller conclude, then, that the cultural transmission
hypothesis is not correct, and that achievement influences popularity or acceptance by
whites.
Readers should study Figures 36.4 and 36.5 carefully and understand clearly what was
done. As said earlier, the example is important theoretically and practically. It casts doubt
on the influential cultural transmission hypothesis as an explanation of minority children's
achievement in desegregated classrooms. It suggests that white pupils' acceptance of
black children springs from black pupils' achievement. One cannot of course accept these
findings as conclusive. We have used the study analysis to help explain analysis of covari-
Past
Achievement
Present
Achievement
Popularity,
Whites
Figure 36.5
614 • Multivariate Approaches and Analysis
ance structures and LISREL analysis. We are not saying that the results are definitive or
conclusive. The sample was only 154 black pupils in desegregated classrooms. The re-
sults are suggestive, however, and the research should be replicated with larger samples,
in different parts of the country, and with other variables.
CONCLUSIONS—AND RESERVATIONS
It would be Wrong to create the impression in the reader's mind that all problems attacked
with analysis of covariance structures and LISREL work out as well as those described in
this chapter, or that LISREL should be used with all multivariate research problems. Quite
the contrary. The purpose of this final section of the chapter is to try to put the subject into
reasonable perspective.
Let's ask the most difficult question first: When should the procedure be used? As
usual with such questions, it is hard to say clearly and unambiguously when it should be
used. One is fairly safe is that it should not be used routinely or for ordinary
precept that
statistical analysis and calculations. For instance, it should not be used to factor analyze a
set of data to "discover" the factors behind the variables of the set. It is simply not suited
to exploratory factor analysis or to testing mean differences between groups or subgroups
of data. If it is possible to use a simpler procedure and obtain answers to research ques-
tions, like multiple regression, discriminant analysis, or analysis of variance, then using
LISREL is That inappropriate use will be attempted is obvious. LISREL has
pointless.
recently been added to the SPSS package, which means, among other things, that it will
be used often. Unlike many other procedures, LISREL is hard to use because its appropri-
ate use requires rather difficult conceptualization, technical understanding of measure-
ment theory, multiple regression, and factor analysis. The same was true, if to a lesser
extent, of the use of factor analysis. Yet factor analysis has been "successfully" inte-
grated into the body of behavioral research methodology, but too often poorly used. The
nature of computer packages almost makes this inevitable. One of their purposes is to
make easy what is essentially not easy. So I suspect that we will be seeing the publication
of many studies that have used LISREL —
inadequately. In short, LISREL should only be
used at a relatively late stage of a research program when "crucial" tests of complex
hypotheses are needed.
Analysis of covariance structures and LISREL'-^ are most suited to the study and
analysis of complex structural theoretical models in which complex chains of reasoning
are used to tie theory to empirical research. Under certain conditions and limitations, the
system is a powerful means of testing alternative explanations of behavioral phenomena.
It is not well-suited to testing the statistical significance of ordinary statistics. And it is not
plish what LISREL can. Maruyama and Miller made this point when they discussed why
they used LISREL to reanalyze the desegregation data of Lewis and St. John. LISREL
often has the capability of neatly settling research hypotheses issues when other methods
cannot do so. Yet it is not a generally applicable methodology.
There are often technical difficulties in using LISREL. We have already discussed
large and significant ;^'s with large numbers of subjects, and have suggested remedies,
especially study of residuals and the use of R~'s and coefficients of determination which
the latest versions of LISREL (LISREL-V and VI) calculate and report, and the use of
coefficients that do not depend on sample size."^ Another remedy is the testing of alterna-
tive hypotheses when the problem permits such testing.
One of the most difficult problems is identification.A model being tested must be
overidentified. This means must be more data points, usually variances and
that there
covariances, than parameters estimated."^ If there are p x variables and q v variables
let's say p = 3 and q = 2 —
then there can be no more than t parameters estimated from the
data, where t = i(p + q)(p + ^ + 1). If ;> = 3 and ^ = 2, then / = i(3 + 2)(3 + 2 +
1) = 15, and no more than 15 parameters can be estimated in a model. There are other
conditions that can make a model not identifiable, but it is extremely difficult to specify
them in advance."^
The commonest technical difficulty is closely related to identification. For any one or
combination of reasons a problem may not run and may announce that "something" is
wrong. But what? On the other hand, the computer run may be completed, but some of the
parameters may not make sense. For example, negative variances may be reported. Why?
Anyone who has used LISREL to any extent is familiar with the lugubrious messages the
computer announces. When an expert is consulted, the answer is invariably: "There is
something wrong with the model." Yes, of course! But what? And, naturally, theoretical
models often do not fit: "There is something wrong with the model!" And, too, there is
the frequent occurrence of the computer analysis that works beautifully, but the statistics
indicate that the researcher's model doesn't fit. Is the theory wrong? If one is strongly
committed to a theoretical position, it may be hard to admit this. In any case, one has to
check several possibilities. One, the model doesn't fit because it was poorly or incorrectly
conceptualized. Two, it doesn't fit because the LISREL user made a mistake (or two, or
three) in using the system. Three, the computer analysis won't work because there are
flaws in the data (strong multicollinearity in a correlation matrix, for example). And four,
the model doesn't fit because the theory from which it was derived is wrong or is inappli-
cable.
Inadequate measurement is a limitation of much behavioral research. The technical
measuring psychological and sociological variables is still not appreciated by
difficulty of
researchers in psychology, sociology, and education. It is not easy to devise tests and
scales to measure psychological and sociological constructs; it is also not easy to do the
psychometric research necessary to establish the reliability and validity of the measures
used. It is even more difficult, evidently, to admit that one's measures are deficient. Too
often in behavioral research measures in common use are accepted and used without
question. And rarely are assumptions about study variables questioned. If we are measur-
ing, say, authoritarianism we assume that part of the latent variable authoritarianism is
anti-authoritarianism (whatever that is). Early in this chapter and in Chapter 35 we studied
^* Bender's discussion of these problems and their remedies is helpful: Bentler, op. cit., pp. 427-429.
"/Wrf.. p. 426; Joreskog and Sorbom. LISREL Manual, pp. L20-L24.
"* Fortunately, the program announces when a model being run cannot be identified. Yet to know why is
often a mystery.
616 • Multivariate Approaches and Analysis
research that sprang from questioning the commonly held assumption that conservatism
and liberalism are logical and empirical opposites. Unfortunately, a number of studies
have been done —
and marred —
by measurement of social attitudes based on this as-
sumption."^ Similarly, other studies have been marred, perhaps ruined, both by incorrect
assumptions and inadequate measurement.
An analytic methodology, no matter how well-conceived and powerful, cannot make
up for measures whose reliability and validity are unsatisfactory. Validity by assumption
is a particularly severe threat to scientific conclusions because measurement procedures
are not questioned or tested: their reliability and validity are assumed. It is a poor factor
analysis that emerges from factoring what is in effect sloppy choice or construction of
tests and scales. Similarly, it is poor use of analysis of covariance structures when some or
all of the measures used have little sound technical basis in psychometric theory and
empirical research. The point being made should be strongly emphasized: Elegant proce-
dures applied to poor data gathered without regard to theory and logical analysis cannot
produce anything of scientific value.
Another difficulty for users of LISREL is that modem multivariate structural analysis
is quite different from most earlier statistical analysis. The preoccupation of classical
"'See F. Kerlinger, Liberalism and Conservatism: The Nature and Structure of Social Attitudes. Hillsdale,
Study Suggestions
1. There are as yet no readily accessible discussions of analysis of covariance structures in the
literature. Perhaps there never will be. The subject presupposes knowledge of matrix algebra, factor
analysis, and multiple regression analysis. For some of Joreskog's papers almost all of which are —
difficult —see the book Aiharues cited in footnote 1 To use the LISREL program, the most recent
.
Joreskog and Sorbom manual, also cited in footnote 1, is recommended. It is much easier to
understand and use than earlier manuals. Perhaps the best and clearest exposition of the system is
Bcntler's Annual Review article (see footnote X). Another helpful exposition is; W. Saris, "Linear
Structural Relationships," Quality and Quantity. 14 1980), 205-224. Saris and Stronkhorst have
(
written a useful introductory text: W. Saris & L. Stronkhorst, Causal Modelling in Nonexperimental
Research. Amsterdam: Sociometric Research Foundation, 1984. Joreskog's presidential address to
the Psychometric Society is also excellent —
but not easy: K. Joreskog, "Structural Analysis of
Covariance and Correlation Matrices," Psydwmetrika. 43 ( 1978), 443-447. Researchers who have
used analysis of covariance structures and LISREL will profit from Cliff's skeptical and cautionary
article: N. Cliff, "Some Cautions Concerning the Application of Causal Modeling Methods,"
Multivariate Behavioral Research, 18 1983), 1 15-126. After saying that analysis of covariance
(
structures is perhaps the most important and influential statistical revolution of the social sciences.
Cliff also says.
Initially, these methods seemed to be a great boon to social science research, but there is some
danger that they may instead become a disaster, a disaster because they seem to encourage one
to suspend his normal critical faculties. Somehow the use of one of the.se computer procedures
lends an air of unchallengeable sanctity to conclusions that would otherwise be subjected to the
most intense scrutiny.
I agree with Cliff and will return to the danger he mentions in Appendix B on the computer and its
Geiselman, R., Woodward, J., and Beatty, J. "Individual Differences in Verbal Memory
Performance: A
Test of Alternative Information Processing Models." Journal of Experi-
mental Psychology: General, 111 (1982), 109-134. A sophisticated cognitive psychologi-
cal study of short-term and long-term memory. Found that memory recall was a dualistic
process.
JuDD, C, and Milburn. M. "The Structure of Attitude Systems in the General Public: Com-
parisons of a Structural Equation Model." American Sociological Review. 45 (1980),
627-643. Complex study that showed that Converse's statements about the general public
not having attitude structure is incorrect. Based on national survey data collected in 1972,
1974, and 1976 by the Survey Research Center of the University of Michigan.
Hill, P., and McGaw, B. "Testing the Simplex Assumption Underlying Bloom's Taxon-
'
omy .
'American Educational Research Journal. 18(1 98 1), 93-101. The authors compe-
tently tested a hierarchical model of Bloom's well-known classification of cognitive be-
haviors.
Kerlinger, F. "Analysis of Covariance Structures Tests of a Criterial Referents Theory of
Attitudes." Multivariate Behavioral Research. 15 (1980), 403-422. The larger report of
which the chapter example was one part.
WoLFLE, L., and Robertshaw, D. "Effects of College Attendance on Locus of Control."
Journal of Personality and Social Psychology. 43 (1982), 802-810. Interesting and well-
done study of data from a national longitudinal study of the high school class of 1972.
A PPENDICES
Appendix A
and
Historical
Methodological
Research
The limits of this book forbid adequate discussion of two important and very different
kinds of research: historical research and what will be called methodological research.
This appendix will acquaint the reader with the nature of historical and methodological
research and point out the part they play in social scientific and educational research.
HISTORICAL RESEARCH'
Historical research is the critical investigation of events, developments, and experiences
of the past, the careful weighing of evidence of the validity of sources of information on
the past, and the interpretation of the weighed evidence. The historical investigator, like
other investigators, then, collects data, evaluates the data for validity, and interprets the
data. Actually, the historical method, or historiography, differs from other scholarly
activity only in its subject matter, the past, and the peculiarly difficult interpretive task
imposed by the elusive nature of the subject matter.
Historical research is important in behavioral research. The roots of behavioral disci-
plines have to be understood if behavioral scientists are to be able to put their theories and
research in appropriate contexts. This is perhaps more so in disciplines like sociology,
economics, and political science than it is in psychology. Nevertheless, even the psychol-
ogist must know psychology's origins, since theories develop almost always in a context
of earlier theories and research. Reinforcement theory, for example, developed from
earlier work by Pavlov, Thomdike, and others, and psychologists of today must use this
earlier work as a cognitive stratum, so to speak, from which they do their work.
'
I wish to thank Professor David Madsen for expert help and guidance with the historiographical literature
and in pointing out trends in historiography.
Historical and Methodological Research • 621
METHODOLOGICAL RESEARCH
Methodological research is controlled investigation of the theoretical and applied aspects
of measurement, mathematics and statistics, and ways of obtaining and analyzing data.
Without methodological research, modem behavioral research would still be in the re-
search dark ages. Like historical research, it is an extremely important part of the body
scientific. This strong statement is made to counteract the somewhat negative sentiments
that many professionals in psychology, sociology, and education seem to hold about
methodological research.
Methodology is called "mere" methodology. The methodologist is called a "mere"
methodologist. This is a curious state of affairs. Some of the most competent, imagina-
tive, and creative men in modem psychology, sociology, and education have been and are
methodologists. Indeed, it is almost impossible to do outstanding research, though one
can do acceptable research, without being something of a methodologist. It is needless to
pursue the prejudice further. My point is that methodological research is a vital and
absolutely indispensable part of behavioral research. Let us look at what the methodologi-
cal researcher does and see why these statements have been made.
Perhaps the largest and most rigorous areas of psychological and educational method-
ological research are measurement and statistical analysis. The methodologist and it —
should be emphasized that good behavioral researchers have to be, to some extent, meas-
urement methodologists —
is preoccupied with theoretical and practical problems of iden-
tifying and measuring psychological variables. These problems have a number of aspects.
Reliability and validity, in and of themselves, are large areas of preoccupation and inves-
"M. Borrowman, "History of Education," In C. Hums. ed.. Encyclopedia of Educulioiial Research. 3ded.
New York: Macmillan, 1960. pp. 661-668. (See especially pp. 663-664.)
'Social Science Research Council, Theory and Practice in Historical Study: A Report of the Committee on
Historiography. New York: Social Science Research Council, 1946, pp. 134-135.
Historical and Methodological Research • 623
tigation. Then there are the theoretical and practical problems of the construction of
measuring instruments; scaling, item writing, item analysis, and so on. A methodologist
can easily spend a lifetime on any one of these aspects of measurement.
Statisticians long ago turned their talents to solving the problem of the objective
evaluation of research data. Their contributions were considered at length earlier and need
not be repeated here, it is significant to add, however, that some of the most outstanding
methodological contributions of statistics have come from applied researchers Fisher —
and Thurstone, to name only two.
The application of mathematics to social scientific research is well developed. Appli-
cations of set theory and probability theory were discussed early in this book. Later in the
book we saw how important matrix theory and algebra are, especially in multivariate
analysis. Analysis of covariance structures and log-linear models and analysis are two
"•
REFERENCES
Some may wish to pursue historical and methodological inquiry further. Readers
readers
should also know that the point of view of this appendix is more or less a traditional one.
Some critics would therefore take issue with it. One of the best single sources on histori-
ography is the Social Science Research Council monograph cited in footnote 3. Here are
four books, three of which express a wide variety of views on history and historiography.
Except for textbooks (like this one), which usually lack critical focus and depth, books
on the general topic of methodology —
that is, books that examine methodology itself
seem not to have been published.^ The best single source for original contributions to the
methodology of behavioral research is probably the Psychological Bulletin. There has
hardly been a recent development in statistics, methods of observation, and general ana-
lytic approaches that has not appeared in the Bulletin.
Best, J., ed. Historical Inquiry in Education: A Research Agenda. Washington, D.C.: Ameri-
can Educational Research Association, 1983. Varied views on the history of education in
a book published by the American Educational Research Association.
Dollar, C, and Jensen, R. Historian's Guide to Statistics. New York: Holt, Rinehart and
Winston, 1971. Statistics text written specifically for historians. In addition to statistics
•For an excellent book on the mathematical bases of multivariate analysis, see: P. Green, Mathematical
Tools for Applied Multivariate Analysis. New York: Academic Press, 1976.
'This statement is true only when methodology is conceived technically, that is, methods of observation,
data collection, sampling, statistical analysis, and so on. Broader aspects of methodology, such as objectivity,
operationism, inference, the nature of "reality," and so on. have of course been considered by philosophers of
science.American behavioral researchers tend to be preoccupied with the technical aspects of methodology.
European behavioral researchers, however, tend to be more concerned about philosophical issues.
624 • Appendix A
The Computer
and Behavioral
Research*
The computer has become an integral and highly important part of both the conception
and the methodology of research in the behavioral disciplines. Its central role makes it
imperative that we examine, if too briefly and superficially, the major characteristics of
the modem high-speed computer. We will also explore the influence of computers, com-
puter programs, and computer uses on behavioral research and research findings and on
the work of behavioral researchers.
Another purpose of this appendix is practical: to guide the reader in computer analysis
of research data. In doing so, we will consider the use of the large university computer and
computer installation and the use of the microcomputer. Computer analysis has been done
mostly on the large computer using so-called package programs like BMDP, SAS, SPSS,
and the like. The remarkable technical development of microcomputers often called—
personal computers —
and their increased power and decreasing cost, however, are radi-
cally changing analysis and analytic practices. Within five years or so, students of behav-
ioral research will either own microcomputers or have them readily available on university
campuses. The future of much, perhaps most, behavioral data analysis lies with the
microcomputer. Part of the discussion of this appendix, therefore, will be on microcom-
puters and their uses.
thanks to Prof. R. Rankin and to G. Rankin and S. Goff for expert and generous help and guidance in the
complex and mysterious world of the microcomputer, and for critical reading of this appendix If the appendix is
.
High Speed
Everyone has heard that computers are fast. Few people can really grasp how fast they
are. They are faster than almost anything we work with. In 1972 Kemeny said that a
computer is a million times faster than a human being." Here is an example. Appendix C
of this book reports one sample of 4,000 random numbers between and 100. I had the
1
computer generate four such samples of 4,000 numbers each, calculate the means, stand-
ard deviations, and correlations of the 40 "variables" in each of the samples of random
numbers like those of Appendix C. That is. Appendix C reports only the first of four
samples for each of which the random numbers were generated, and 40 means, 40 stand-
ard deviations, and 780 correlation coefficients calculated. This is 16,000 random num-
bers, 160 means, 160 standard deviations, and 3,120 correlation coefficients. In addition,
the means of the four sets of means were rank ordered from high to low. The recorded
time for all the computations was ll.l seconds!^
Curious and skeptical readers may wonder why such high speeds are desirable. After
all, what practical difference does it make if a computer takes seconds, minutes, or even
hours to do a given problem? The answer is complex and we here give only two reasons,
the two that are probably the most important for behavioral researchers. Psychologists,
sociologists, and other behavioral researchers are more and more using multivariate pro-
cedures to analyze their data and even to help conceptualize the research problems they
study. Factor analysis and multiple regression analysis have become common, and other
multivariate methods like discriminant analysis, multivariate analysis of variance, analy-
sis of covariance structures, and log-linear models are used increasingly. Such methods
are virtually impossible to use without the high-speed computer because they require
complex analysis requiring thousands of computations. Without very high computer
speeds the computations become awkward and cumbersome, if not impossible.
A second reason for high speed is that many computer users now work in time-sharing
systems. Timesharing is just what the expression says: many users work at terminals and
-J. Kemeny, Man and the Computer. New York: Scribner's, 1972. p. 14.
'Such remarkably computing times are obtained with larger computers. The typical microcomputer, if
fast
its memory is sufficient to accommodate large problems, takes much longer. For example, I had a well-known
microcomputer generate 15 sets of 100 random digits through 9, "bias" some of the digits, and then calculate
1
means, standard deviations, sums of squares, and correlations among the 15 "variables." The calculations took
well over an hour. Don't misjudge microcomputer speed, however. The microcomputer I used was not very fast.
Yet even the one-hour time is fast compared to the computing time of earlier computers. For discussions of
computer speed and its desirability, see Kemeny, op. cit., pp. 14ff.;C. E\ans. The Micro Millenium. New York:
Washington Square Press, 1979, pp. 54-57.
The Computer and Behavioral Research • 627
share the available time. (A terminal is a special keyboard with a screen, like a TV screen,
both connected to a large computer either directly or indirectly by means of what is called
a modem, which makes it possible to communicate with the computer over the telephone.
To make it possible for as many as a hundred users to use the same computer more or less
at the same time, the computer must be very fast.
Let's compare time-sharing with what is called a batch system, which used to be the
major way researchers used the computer. This comparison may clarify the power of
time-sharing, a "direct"" interactive relationship with the computer, and the need for high
speed. In the batch system, computer users type (punch) their programs and data on cards
and submit decks of such cards to a clerk who puts the cards into a card reader. After the
cards are read, the program and data are stored on tape or disk, and the job is put on a
waiting list to be processed by the central processor. One job after another is processed by
the central computer in a serial fashion: it is, therefore, a serial system. No matter how
fast or how slow the computer is, the system works the same way: one job after another.
A time-sharing system, on the other hand, handles many jobs "simultaneously"" and
rapidly. Generally speaking, each job is allocated a certain number of computer instruc-
tions for a given time unit. When a job's turn comes up, the computer performs the
number of operations that the job is allocated, stores the results, and goes on to the next
job. A job may not use itsallocated time; in this case the time can be given to the next job.
For example, suppose I type in at my terminal the commands and data for my job, and the
computer accepts the job (after checking the legitimacy of my identification and time
allocation) and processes, say, n operations of my job. Now suppose you type in another
job at your terminal. The system starts processing your job after it has done the n opera-
tions of my job. It then processes, say, m operations of your job. It may then go on to
another job or return to my job. Suppose it returns to my job, finishes it, and stores it in
a "waiting" memory (waiting to be printed). Then it goes on to another job similarly.
Let us say that your job and my job are finished and in the "waiting" memory waiting
their turns to be printed —
or maybe aborted (if something was wrong with them). The
computer of course goes on to other jobs and similarly stores the results in memory. And
so it goes for as many as 100 jobs! For such a system to be successful, the computer must
be very, very fast and must be programmed efficiently. It is possible in a well-designed
time-sharing system for the computer to process 100 jobs in ten seconds! This description
is of course oversimplified; it serves, however, to show how it works.'' It should also be
Large Memory
A second basic characteristic of the modem computer is large memory. The processing of
verbal or numerical data requires large amounts of storage. A computer can work without
large memory, but it would be slow, inefficient, and perhaps inaccurate. When computers
were first made, computing was done piecemeal, so to speak. One phase of a job was
completed, and the results of this phase were fed into the computer (with cards or tape) for
further computing. These results might then be fed in for further computing. And so on.
It is extremely difficult to know clearly who is responsible for the basic conception of
themodem computer as a self-contained, self-regulating, and general purpose machine.
Two mathematical geniuses of the century. Alan Turing and John von Neumann, how-
"Most computer systems use another computer to "administer" and monitor the job process. For a
large
lively description and the rationale of the development of an actual time-sharing system at Dartmouth College by
one of its pioneers, see Kemeny. op. cit., chap. 3. Note that you can enter an analytic job in. say. half an hour
and, if you have made no errors, have the results printed on your own printer — assuming you have a printer and
it has been suitably set up with your terminal — in a few minutes! So can 1. If either of us has made one or more
errors, the system will abort the work and tell us what and where the errors are! Anyone who has spent years
using the batch system and first successfully uses the terminal learns the difference — dramatically. An authorita-
tive account of the original development of time-sharing at MIT is given by M. Denicoff. "Sophisticated
Software: The Road to Science and Utopia." in M. Dertouzos and J. Moses, eds.. The Computer Age: A
Twenty-Year View. Cambridge. Mass. MIT Press. 1979, pp. 370ff.
628 • Appendix B
ever, saw that the piecemeal system described above was inadequate and formulated and
helped implement the principle of the computer being able to do within itself all the
procedures and processes necessary for computing.'' This means that when the computer
has finished, say, one phase of a job, it stores the results and goes on to the next comput-
ing phase. When it needs the stored results for further computations, it retrieves them
from memory. Thus was the modem idea of computer programs bom! Turing and von
Neumann proposed that a set of instructions to the computer be stored in its memory so that
it can always "know" what to do. The program stored in the internal memory of the
computer, in other words, should be the computer director. Then the machine need not
pause for further instructions. The end result is a faster, more efficient, and virtually
error- free device. The idea, of course, is that computers have to be completely self-
regulating: give the complete instmctions for any procedure to the computer so that
they are stored internally, and so write them that the computer can continue to compute to
the end without interruption. If these instructions require "data" to operate on, store
these data, too, in the intemal memory, or at least make the data easily available to the
computer. Obviously, computers need capacious memories.
Universality
A third basic characteristic of modem computers is universality. This means that com-
puters are general purpose machines: they can be programmed to solve any problem that is
solvable. The first computers were built for specific purposes, for example, to calculate
Such computers are limited.
the trajectories of artillery shells or to solve sets of equation.
When the early computers were being developed, it von Neumann pri-
was realized (by
marily) that computers should be universal Turing machines. Turing had shown that a
machine that could do a few basic operations could, in principle, do any calculations.^
What this boils down to is this: modem computers can in principle be instructed ("pro-
grammed") to do all kinds of operations and calculations. They are, in short, "general"
or "universal" machines.
Flexibility
Flexibility is really a characteristic of the use of the computer. What is meant is that there
are several ways to write a program
accomplish a given end. Virtually identical results
to
can be achieved with different instmctions. That is, the way computers work permits
flexibility of programming. This means that you and I can write different programs to
calculate analysis of variance, say, and both our programs will yield the same results
(assuming, of course, that both programs are "correct," a very large assumption indeed).
Ductility
The last important characteristic of the computer to be discussed here is what I will call
ductility, which, loosely defined, means tractability or obedience. A computer will do
-''For von Neumann's thinking and work, see the fine book of Goldstine: H. Goldstine, The Computer:
From Pascal von Neumann. Princeton: Princeton University Press. 1972. Hodges has written an exhaustive
lo
biography of Turing with detailed accounts of his thinking on the computer and his influence on von Neumann
(see footnote 6, below). Weizenbaum's explanation of Turing's ideas is detailed and clear (again, see foot-
note 6).
^This is difficult to explain. Weizenbaum explains it in detail with examples: J. Weizenbaum, Computer
Power and Human Reason From Judgment to Calculation San Francisco: Freeman 976 pp 5 1 ff. Turing was
: . . 1 . .
a member of a small group of mathematicians who during World War II succeeded in breaking the secret code
the German high command used to transmit orders to field commanders, thus materially helping the British war
For a fascinating but sad account of Turing's brilliant thinking, work, and life, see A. Hodges. Alan
effort.
Turing: The Enigma.New York: Simon and Schuster, 1983. von Neumann seems to have gotten his ideas for a
modem computer from Turing.
The Computer and Behavioral Research • 629
precisely what it is told to do. If a programmer solves a difficult problem brilliantly, the
'Despite this rather strong statement, it must be remembered that computers often produce wrong results
for one or more of several reasons. If. for example, the input data are large numbers and sums of squares and
cross products are computed, the accuracy limitations of the machine can be exceeded. Most computers have a
double-precision feature that doubles accuracy (and is slower). A frequent source of errors is incorrectly typed
(punched) input. Just think: one incorrect decimal can throw off the accuracy of a whole set of data. The moral is
that all input and output should be checked for accuracy. Programs should have the options for users to print all
Then users should check, say. a random sample of five or ten
input data, intermediate results, and output data.
percent of the input cases and calculatesome of the intermediate and final results using another means of
computation. Researchers must constantly guard against machine and human error.
*This definition emphasizes quantitative data and numerical analysis because such data and analysis are
most often encountered in behavioral research. Computer programs, however, are by no means limited to
numerical analysis. Indeed, we have seen that the modem computer is a universal machine: it can in principle
solve any problem that is solvable, numerical and nonnumerical. In earlier chapters we learned that content
analysis focuses on the analysis of verbal materials. Modem word processors are computer programs that make it
possible to type verbal materials into computer memories, recall and edit the materials, and print them on an
extemal printer.
^
See the following twok for authoritative discussions of the characteristics and the invention and develop-
ment of computer languages from the beginning of such languages in the early 1950s: R. Wexelblatt, ed..
History of Programming Languages. New York: Academic Press, 1981.
630 • Appendix B
FORTRAN, among other things, uses several Enelish commands like DO, GO TO,
READ, WRITE, CALL, CONTINUE, and IF. These commands mean what they say:
they tell the machine to do this, do that, go to this instruction, read that instruction, and
write (print) the outcome. The power and flexibility of this seemingly simple language
cannot be exaggerated. There is almost no numerical or logical operation that cannot be
accomplished with it.'°
Some computer experts say that the programming languages of the future are "struc-
tured" languages. PASCAL is a prominent and much-used structured language. Yet one
cannot be sure. Pedis, in his view of the future of computer languages, says, "FOR-
TRAN, glorious and persistent weed that it is, not only survives, but its users become
increasingly committed to it, even though theirs is a love-hate relationship."'" Because of
its worldwide use and availability, FORTRAN has become dominant, even though PAS-
CAL is thought to be a more satisfactory language. Pedis points out that even in China,
where one might expect independence of Western views and usage, FORTRAN is becom-
ing dominant! PASCAL is perhaps more difficult to learn and to use than FORTRAN and
much more difficult than BASIC. One of its great strengths is that it can handle both
quantitative and verbal materials. FORTRAN is strong in quantitative analysis but weak
in verbal analysis. We now turn to an easy computer language, BASIC, mainly because it
is the common language of the microcomputer. It is an effective language, even though
'"This description omits consideration of the creation and manipulation of "files" and having FORTRAN or
other programs already in the computer memory can be readily used with a few commands.
so that the files
Another omission is the necessity of learning and using a special installation language for editing files. Suppose,
for instance, that you have a program you have written in BASIC to do analysis of variance and you want to add,
say. one or two formulas lo calculate indices of the magnitude of relations. An editing procedure is necessary to
do this. This procedure, also a program, is called into action, and you edit your program with its help. A/i/e is
any program or set of data, each with a name, that is in the computer memory and that you can invoke and use.
For information on the creation and editing of files at your installation, consult the appropriate people at the
installation.
"A. Perils, "Current Research Frontiers in Computer Science." in Dertouzos and Moses, op cii. pp.
422-436 (p. 427).
'"That BASIC will remain the language of the microcomputer is of course problematical. It is now. but this
may change. I doubt it because BASIC is not only easy to learn and use; it is also powerful enough for most
computing purposes in the behavioral sciences. And, as Kemeny points out. it is nicely suited to man-machine
interaction. Some computer experts deplore its use. but this may be due to specialist admiration for more
structured languages. See S. Papert, "Computers and Learning." in Dertouzos and Moses, op. cil.. pp. 73-86,
for a rather violent attack on BASIC. Papert atlnbutes the use of BASIC to the QWERTY phenomenon, the
persistence of the present typewriter keybo^d not because it has an adequate rationale but because it is widely
used and many people have learned it.
demands of research problems. For example, an analysis of variance program may not
include relational indices such as omega-squared or the coefficient of intraclass correla-
tion. And most package programs do not include intermediary statistics in their outputs so
that one can calculate such indices. Moreover, it is easy for novices even initiates to — —
use an analysis of variance program that has error terms (residual mean squares) that are
inappropriate for their problems. While professional programmers are usually highly
competent people, many of them do not know the reasons for many statistical procedures,
nor would it be economical for them to adapt package programs or to write new programs
for such special needs.
One of the most difficult problems associated with computer work, then, is communi-
cation between researcher and programmer. It is perhaps unrealistic to expect researchers
to be highly expert in programming. But it is even more unrealistic to expect professional
programmers to understand the substance and methodology of behavioral science analy-
sis. The communication between scientist and programmer
best solution of the problem of
is clear: the scientist enough about programming to enable him to talk
must learn at least
knowledgeably and intelligently to the programmer. The researcher can learn to do this in
a matter of months, whereas it would take the programmer years to learn enough about
behavioral science and behavior science analysis to communicate at the researcher's level.
One or two aspects of the computer's impact other than numerical analysis need to be
mentioned if the reader is to comprehend how far-reaching and deep the impact of the
computer has been and will continue to be. Perhaps the most important development is the
computer's versatility and applicability to research problems that require the analysis of
verbal materials.''* In Chapter 30, for instance, we found that content analysis can be a
highly useful approach to the measurement of complex and elusive variables. Conceptual
complexity itself is one such variable. With content analysis programs, such as those
discussed in Chapter 30 and no doubt increasingly available in the next decade as the rich
research possibilities are realized, all kinds of verbal materials can be effectively ana-
lyzed: children's stories and textbooks, editorials, projective materials, interview proto-
'"* Examples of the increasing use in psychology of content analysis have been mentioned in earlier chapters.
632 • Appendix B
cols, essays, speeches, and so forth. Attitudes, motives, creativity, and historical and
philosophical trends are present possibilities, though still not well-developed research
preoccupations.'"''
The language specialist, the psychologist, the sociologist, and the politi-
historian, the
cal scientist can now
analyze historical materials, literary productions, and political rec-
ords in a number of ways. Research like McClelland's pursuit of achievement and its
determinants in the present and the past can be extended and enhanced by using computer
methods of document search and analysis.'^
The tedious and time-consuming business of literature searches to find references
pertinent to a subject has been a scholarly imperative for centuries. It is now possible
indeed, necessary in view of the vast amounts of books and articles published in all
fields —
to have a computer search the literature and note references that are presumably
pertinent to a subject. The service is available in many university libraries for modest fees.
One can go the university (or other) library and request that a search be made for refer-
ences on such-and-such a subject. The librarian will ask for several key words or expres-
sions so that the search can be made practicable. One cannot, for instance, just request
searches for studies of intelligence or achievement. One has to narrow requests to perti-
nent aspects of such broad subjects. Here is an example.
About needed a computer search for reference works on attitudes so
five years ago, I
bipolarity."' The computer searched the literature that had been "banked"" in computer
— —
memory not all journals, of course, are banked and printed out a list of books and
articles that had the key words in the titles. Even with the keyword limitation, the list was
large. I was asked whether I wanted summaries of any of the articles. I studied the list,
eliminated most of the references, and requested summaries of a few articles. Later I
requested full copies of five or six articles. The computer saved me weeks of library work,
most of it unfruitful. And note: the requests were made in Amsterdam, but the computer
search itself was done in California!
"I think that within four or five years some word-processor programs for microcomputers may include
sophisticated content analysis possibilities. The best microcomputers will have sufficient memories and speed to
handle content analysis. The real challenge will be researcher imagination and ingenuity.
""D. McClelland. Ttie Achieving Society. Princeton. N. J.: Van Nostrand. 1961. especially chap. 4. In a
study of drinking. McClelland used the General Inquirer computer-based system mentioned in Chapter 30 to
study folk-tale themes associated with drinking: D. McClelland etal.. "A Cross-Cultural Study of Folk Content
and Drinking," Sociometry. 29 (1966). 308-333.
"I am grateful to my former colleague. Dr. Harrie Vorst, for his expert help with the computer search, and
to the Faculty of the Social Sciences of the University of Amsterdam for making the search possible.
The Computer and Behavioral Research • 633
computer will "take over" and somehow hurt scholars and scholarly work. It is no doubt
true that misuse of computers can hurt scholars and their work. But most of the beliefs
about computers and computer destruction of human values is non.sense. The computer
can only get out of hand through ignorance, avoidance, and misuse. We are the masters.
We use the computer. And we must master and use it not because it is fashionable to do
so but because modem communication and research demand that we do so. If scholars
refuse to work with computers for whatever reason, then they abandon the important
policy determinations that shape the future and the future use of the powerful technology
that is transforming scholarly work and research. To leave such determinations in the
hands of computer specialists and technicians and academic administrators is to abandon a
large part of the scholarly and research enterprise.
'*In the only published account have seen of a comparative assessment of the factor analysis programs of
I
the three leading packages, SPSS, BMDP, and SAS. MacCallum found SAS the most satisfactory, BMDP the
next most satisfactory, and SPSS the least satisfactory. Very significantly. MacCallum found that the programs
were so constructed that the user can leave many decisions to the computer, a simation he rightly deplored. R.
MacCallum, "A Comparison of Factor Analysis Programs in SPSS, BMDP, and SAS," Psychometrika, 48
(1983). 223-231. The journal editor added a note that this review was the first of a new type of paper to be
published by the journal; evaluative descriptions of widely distributed computer programs. This is a commenda-
ble policy.
634 • Appendix B
analysis program and to let the program assume control of what happens and then to try to
extract meaning from the results is hardly appealing. This does not necessarily mean that
we are technical masters of the many ramifications of factor analysis, but it does mean that
we have a fair grasp of the technical principles behind the various forms of statistical
analysis, the mathematics of the general linear model, and, most important, that we are
committed to the principle that we ourselves analyze research data and draw conclusions
from the results. We do not put our whole dependence on computer experts, on graduate
student assistants, and on computer package programs. This is based on the principle that
analysis is part of the research problem and the research design and plan to obtain answers
to the problem. It is not something farmed out to technicians and assistants and the
computer itself.
This position is evidently at considerable variance with much actual practice. It is thus
controversial. MacCallum, in the review of package programs cited above, was forced to
conclude that the factor analysis program of one of the packages he studied could not be
recommended to users. He emphasized that the programs of the computer packages he
reviewed were so constructed that the user can leave many important choices of methods
to the program.''' His point is well-taken.
Etzioni has pointed out the possible deleterious effects of microcomputers on scientific
practice. ~° He deplored the increasing trend to do scientific work "by what is in effect, a
trial-and-error search, rather than a focused effort. Such a development would be a latter-
day repeat performance of the impact of the introduction of prepackaged computer pro-
grams on some branches of the social sciences." He points out that finding new variables
to study group differences requires considerable intellectual, not mechanical, effort less —
use of prepackaged programs and more of scientific creativity. What he means, I think, is
that the ready availability and ease of use of microcomputers makes it very easy for the
researcher to plunge into analysis without having thoroughly studied problems and their
implications and ramifications. Finally, Etzioni stresses the need for training graduate
students to recognize the danger of letting the computer set the direction of their work and
the need for reflection and thought. To this excellent advice I would add the need for
behavioral science and education faculties to recognize the danger and, even more impor-
tant, the need for adequate study and learning of both faculty and graduate students of the
research methodology necessary to do adequate scientific behavioral research.
A HORTATORY CONCLUSION
Sermons are bores. More important, they probably have little effect except to produce
yawns and exasperation. In providing students of research with recommendations on how
best to use the computer, one is certainly inclined to deliver a sermon because the fascina-
tion and power of these remarkable machines have a seductive influence that is extremely
hard to resist and that can lead to marked lowering of scientific criteria and standards. This
is part of the power that Weizenbaum talked about, especially when he described hackers
and hacking. It is what Etzioni talked about when he deplored the danger of mindless use
of computers.
Suppose one agrees that there is a danger to science, scientists, and research; what do
students do both to use the computer and to avoid the dangers inherent in its use? How can
we avoid the danger Etzioni stressed of losing the essential core of science and research
through trial-and-error use of the computer rather than the use of focused intellectual
efforton research problems'? I risk the boredom of sermonizing by suggesting two or three
things one can do.
"MacCallum {ibid. p. 230) is referring to the defaults built into large computer programs So when users do
. .
not really know much about, say, factor analysis or multiple regression, they can let the program "decide" how
to analyze the data —
by default.
-°A. Etzioni, "Effects of Small Computers on Scientists," Science. 189 (1975), editorial.
The Computer and Behavioral Research • 635
One and most important, remember that the basic purpose of behavioral science and
research is psychological and sociological theory, explanation of human behavior. Any-
thing that interferes with the pursuit of theory is a threat to science. I won't belabor how
the computer can be such a threat since I've already done so earlier.
Two, behavioral researchers should learn at least one computer language, perhaps
FORTRAN because it is used everywhere and is always available (maybe not a good
argument!), or BASIC because it is easily learned (again, maybe not a good argument!), it
is it is, at least at present and in the immediate
quite adequate for statistical analysis, and
future, the language of the microcomputer. Psychologically, the successful writing of
programs is an enormous boost to one's scientific morale, so often beaten down in our
highly technical, budget- and market-oriented surroundings. Yet it is not only writing
programs; computer language is one of the most important keys to mastering the computer
and to solving analytic problems. For example, if one knows FORTRAN or BASIC, one
can build quite powerful programs by using the subroutines provided by various sour-
ces.-' One writes input and output statements and calls the subroutines as needed.
Three, total dependence on statisticians and computer specialists is unwise. Behav-
ioral researchers have to have sufficient methodological competence to do most things
themselves. Technical people are indispensable, but they usually know little about science
and research. The behavioral scientist, in other words, has to know enough of statistical
analysis and computer technology to be able to use technicians as resources rather than as
preceptors. The microcomputer can help greatly in developing self-sufficiency and com-
petence in analysis —
if we are constantly alert to the seductiveness of any powerful
mechanical tool. It's so easy to lose oneself in hacking! And so hard to hew to research
problems, to understand the difficult demands of measurement, and to master the intrica-
cies of statistical analysis!
Computers, then, are extremely useful, obedient, and reliable servants, but one must
always remember that their facile output can never substitute for competent and imagina-
tive theoretical, research design, and analytic thinking. Despite the dangers, however, the
reader is urged to explore and learn to use this enormously fascinating and powerful
analytic tool. One thing is certain: researchers who learn a little FORTRAN or BASIC and
who put two or three programs through a machine complex successfully will never again
be the same. They have participated in one of the most interesting and exciting adventures
they will ever experience. The main problem will then be to maintain the balance and
discretion to keep the machine where it belongs: in the background and not the foreground
of research activity.
ADDENDUM
Users of computers, large and small, should know something of their absorbing history
and evolution. The following references are suggested for their interest and their quality.
Bradsh.aw, G., Langley, P., and Simon, H. "Studying Scientific Discovery by Computer
Simulation," 5dencf, 222(1983), 971-975. An account of a remarkable achievement; the
successful simulation of scientific discovery itself.
Branscomb. L. "Information: The Ultimate Frontier," Science. 203 (1979), 143-147. Inter-
esting and disturbing futuristic essay on the influence of the computer: no more letter post,
printing with jet inks (already with us), and so on.
Davis. R. "Evolution of Computers and Computing," Science. 195 (1977), 1096-1102. Very
good essay on the history of computers and computing.
-'For example, D. McCracken, A Guide to Fortran [V Programming. 2d ed. New York: Wiley, 1972. In
McCracken provides several valuable FORTRAN routines. Ruckdeschel lias published
this excellent guide,
BASIC subroutines that can be useful in building analysis programs: F. Ruckdeschel. BASIC Scientific Subrou-
tines. Peterborough, N.H.: Byte/McGraw-Hill. 1981. See, also, J. KemenyandT. Kunz, BASIC Programming.
3d ed. New York: Wiley, 1980.
636 • Appendix B
Dertouzos, M., and Moses. The Computer Age: A Twenty-Year View. Cambridge,
J., eds.
The student who wants to learn a computer language will of course need help. Here are
good guides to BASIC and FORTRAN.
Kemeny, J., and Kurtz, T. BASIC Programming. 3d ed. New York: Wiley, 1980. An excel-
lentmanual by the two main inventors of BASIC. Has examples of useful programs.
McCracken, D. a Guide to FORTRAN-IV Programming. 2d ed. New York: Wiley, 1972.
Standard and excellent guide to FORTRAN, Has useful examples.
A highly useful tool for behavioral researchers is the programmable hand-held calculator-
computer. Advanced models make the calculation of many statistical procedures easily
possible. Programs can be written and recorded on small plastic slides, which can be used
again and again.
It is well to bear in mind always that many analyses do not require a large computer;
they can easily be done with a desk calculator. Indeed, a useful precept might be: Don't
use the large computer unless you have to. Most analyses of variance and multiple regres-
sion analyses can be done with a programmable calculator or a microcomputer. In less
than half a decade, many or most behavioral researchers will have their own microcom-
puters, and good statistical and mathematical programs will be readily available. Readers
are cautioned, however, not to accept and buy too easily sets of statistical programs for
microcomputers. They may be untested —
and expensive. During the next half decade,
too, reviews of package programs for both large and small computers will appear in the
better journals. Until you are fairly sure of the adequacy of a program, use it with circum-
spection and care — or not at all.
Appendix C
Random Numbers
and Statistics'
This appendix contains 4000 random numbers organized in 40 sets of 100 each. The
numbers are whole numbers evenly distributed in the range through 100. The appendix
has three purposes: to supply random numbers and statistics for the text and for the study
suggestions of earlier chapters; to give readers at least some more-or-less direct experi-
ence with random numbers; and to demonstrate randomness with simple statistics. To
achieve the third purpose, basic statistics calculated from the 40 sets of numbers, treating
the sets as variables, are also given below: means, variances, standard deviations, and the
intercorrelations of the 40 "variables."
Random numbers, or rather pseudorandom numbers," can be generated in a number of
ways. One can take the square roots of numbers to several decimal places and extract the
middle numbers from each number. Or one can copy the numbers produced by the spins
of a roulette wheel. Probably the best way is to use the computer and an addition or
multiplication process to produce large numbers and then take parts of these numbers as
random numbers. The 4000 numbers given below were produced by such a method.
'
I am Mr. Edward Friedman, Associate Research Scientist. Computing Center, Courant Institute
grateful to
of Mathematical Sciences, New York University, for his help with the program that generated the random
numbers given in this appendix. The actual random number computer program used as a subroutine was
RANFNYU. a CIMS Computing Center routine
"Oddly enough, numbers generated by a computer are generated with a completely determined calculation.
They are thus called pseudorandom numbers.
638 • Appendix C
Green and Lohnes and Cooley discuss the method, which is called the power residue
method.^
RANCAL, the random number computer program used, generates k sets of N random
numbers each, k and A' are read into the computer. In this case k = 40 and A' = 100. The
program also calculates the means and standard deviations of each of the 40 sets as well as
the mean and standard deviation of all 4000 numbers. The statistics are given below.
Since the random properties of the numbers were discussed in Chapter 12, they need not
be discussed here. It is, of course, possible to test the randomness of the numbers in a
number of ways."* One can count the frequencies of odd and even numbers, or the frequen-
cies of any arbitrarily defined groups of numbers, such as 0-9, 10-19, etc., and then do
chi-square analysis to test the significance of departures from chance expectations. We
now describe a more interesting test.
RANCAL calculates the intercorrelations of the k (=40) sets of random numbers. Since
the numbers are presumably random, the correlations among the 40 sets should hover
around zero with occasional r's in the .10"s and .20"s, but rarely in the .30's (plus and
minus). With N = 100, an r of 197 is significant at the .05 level and an r of .256 at the
.
.01 level. The number of r's equal to or greater than 197 and the number of r's equal to
.
or greater than .256 were counted. Since 5 percent of the total number of r's [k(k -
l)/2 = (40)(39)/2 = 780] is 39 (780 x .05), we can expect to find about 39 r's equal to or
greater than .197. Similarly, 1 percent of the 780 r's. or about 8, can be expected to be
equal to or greater than .256.
To provide a better test, three different additional samples of 4000 numbers each were
generated with rancal, and the r's counted as above. The results of counting the r's in
the four samples are given in Table C. 1 The departures from chance expectations are not
.
significant (by chi-square test). Most of the r's hover around zero. The highest r of the
4 X 780 = 3120 r's is —.35. The numbers appear to be random by this test. It would be
profitable for the student to make up other tests and use them on the data given below. The
importance of understanding and gaining experience with random processes, random
numbers, and Monte Carlo methods cannot be overemphasized.''
RANDOM NUMBERS
1
Random Numbers and Statistics • 641
44 59 90 78 83 4 97 61 52 75 91 76 98 40 41 2 56 78 62 79 16
20
642 • Appendix C
I 13 31 |y 63 90 75 17 33 49 13 54 32 26 66 38 35 16
63
644 • Appendix C
10 49.2800 872.3016
11 48.8700 777.5731
12 53.0800 860.2136
13 56.5100 773.0099
14 47.9900 1110.2299
15 49.3700 913.6531
16 49.0200 714.0396
17 45.6800 842.0776
18 47.0400 853.3384
19 53.5100 977.2499
20 52.7400 853.4924
21 50.0600 1001.1564
22 53.9500 907.7475
23 53.6100 737.3779
24 49,3100 807.4139
25 49.1600 673.9544
26 50.2200 855.3316
27 58.3600 877.1904
28 49.5700 709.7051
29 55.4400 868.3664
30 49.4300 791.3851
31 48.5200 847.9296
32 52.9400 802.3564
33 46.7900 784.6259
34 48.3300 881.4611
35 47.2900 759.3059
36 55.5100 854.5499
37 52.3900 907.8379
38 49.9500 851.3275
39 46.0000 817.7800
40 47.6500 815.1475
Appendix D
The Research Report
THIS APPENDIX has two purposcs: to outline some of the main points of report writing and
to cite appropriate references to guide the reader.
THE PURPOSE
The purpose of the research report is to tell readers the problem investigated, the methods
used to solve the problem, the results of the investigation, and the conclusions inferred
from the results. It is not the function of the investigator to convince the reader of the
virtue of the research. Rather, it is to report, as expeditiously and clearly as possible, what
was done, why it was done, the outcome of the doing, and the investigator's conclusions.
The report should be so written that readers can reach their own conclusions as to the
adequacy of the research and the validity of the reported results and conclusions.
To achieve this purpose is not easy. The writer must strive for the right blend of detail
and brevity, for objectivity, and for clarity in presentation. Perhaps the best criterion
question is: Can another investigator replicate the research by following the research
report? If not, due to incomplete or inadequate reporting of methodology or to lack of
clarity in presentation, then the report is inadequate.'
'The realities of publishing and its above statement. Book publishers and journal editors do
costs limit the
not have the space available to make it enough details of research studies so that they can be
possible to publish
replicated. Indeed, the constraints on editors are such that they can hardly publish sufficient details of research
for readers to make informed judgments of the methodological adequacy of the studies. Nevertheless, the
criterion question should always be kept in mind.
646 •
Appendix D
THE STRUCTURE
The structure of the research report is simple. It is almost the same as the structure of the
research itself: the problem, the methodology, the results. Here is a general outline:
I. Problem
1. Theory, hypotheses, definitions
2. Previous research; the literature
n. Methodology-Data Collection
1 Sample and sampling method
2. How hypotheses were tested (methodology), experimental procedures, instrumentation
3. Measurement of variables
4. Methods of analysis, statistics
5. Pretesting and pilot studies
DI. Results, Interpretation, and Conclusions
THE PROBLEM
The problem section differs greatly in different reports. In theses and books, it is usually
long and detailed. In published research reports, it is kept to a minimum (see footnote 1).
The basic precept, though seemingly obvious, is not easy to follow: Tell readers what the
research problem is. Tell it to them in question form. For example. What effect does
equalized extrinsic environment have on the mental status of school children?' Does past
experience with hiaterials have a negative effect on problem-solving involving the materi-
als?'' How do social attitudes influence judgments of the effectiveness of social policies?'*
The statement of the general problem is usually not precise and operational. Rather, it
sets the general stage for the reader. The subproblemS, however, should be more precise.
They should have implications for testing. For example; Can a person conversing with
others manipulate conversation by agreeing or disagreeing with the others, or by para-
phrasing what they have said?'' The Jones and Cook statement given in the preceding
paragraph is made more operational by specifying the social attitudes and the judgments
of social policies affecting blacks: Do attitudes toward blacks affect recommendations of
social policies for improving black welfare? Do individuals with positive attitudes toward
blacks recommend societal change, and do individuals with negative attitudes recommend
that blacks improve themselves?
Some report writers, rather than state the problems, state the general and specific
hypotheses. A good practice would seem to be to state the broader general problem and
then to state the hypotheses, both general and specific. The reader is referred to Chapter 2
for examples. Whatever way is used, bear in mind the main purpose of informing the
reader of the main area of investigation and the specific propositions that were tested.
At some point in the problem discussion the variables should be defined, or at least
mentioned or generally characterized, with more specific definitions given later. Variable
definition was discussed at length in Chapter 3 and need not be repeated here, except for
the admonition: Inform the reader not only of the variables but also what you mean by
them. Define in general and operational terms, giving justification for your definitions.
'A. Firkowska, A. Ostrowska, M. Sokolowska, Z. Stein, M. Susser, and I. Wald, "Cognitive Development
and Social Policy," Science. 200 (1978), 1357-1362. (Note again that my problem statements are often,
perhaps usually, different from the authors'.)
'H. Birch and H. Rabinowitz, "The Negative Effect of Previous Experience on Productive Thinking,"
Journal of Experimental Psyxchology. 41 (1951). 121-125.
"S. Jones and S. Cook, "The Influence of Attitude on Judgments of the Effectiveness of Social Policy,"
Journal of Personality and Social Psychology, 32 (1975), 767-773.
'W. Verplanck, "The Control of the Content of Conversation; Reinforcement of Statements of Opinion,"
Journal of Abnormal and Social Psychology, 51 (1955), 668-676. The statement is an operational expression of
reinforcement theory.
The Research Report •
647
There are two main reasons for discussing the general and research literature related to
the research problem. The of these is the more important: to explain and clarify the
first
theoretical rationale of the problem. Suppose, like Haslerud and Meyers, one were inter-
ested in investigating the relative effectiveness for transfer of self-discovery of principles
by learners and systematic enunciation of the principles to learners.^ Since the problem is
in part a transfer of training problem, one would have to discuss transfer and some of the
literature on transfer, but especially that part of the literature pertinent to this problem.
One may well want to discuss to some extent philosophical and pedagogical writings on
the theory of formal discipline, for instance. In this manner the investigator provides a
general picture of the research topic and fits his problem into the general picture.
A second reason for discussing the literature is to tell the reader what research has and
has not been done on the problem. Obviously, the investigator must show that his particu-
lar investigation has not been done before. The underlying purpose, of course, is to locate
the present research in the existing body of research on the subject and to point out what it
contributes to the subject.
Methodology-Data Collection
The function of the methodology-data collection section of the research report, of course,
is to tell the reader what was done to solve the problem. Meticulous care must be exer-
cised to so report that the criterion of replicability is satisfied. That is, it should be
possible for another investigator to reproduce the research, to reanalyze the data, or to
arrive at unambiguous conclusions as to the adequacy of the methods and data collection.
In books and theses there can be little question of the applicability of the criterion. In
research journal reports, unfortunately, the criterion is difficult, sometimes even impossi-
ble, to satisfy. Owing to lack of journal space, investigators are forced to condense reports
in such a way that it is sometimes difficult to reconstruct and evaluate what a researcher
has done. (See footnote 1.) Yet the criterion remains a good one and should be kept in
mind when tackling the methodology section.
The first part of tell what sample or
the methodology-data collection section should
samples were used, how they were selected, and why they were so selected. If eighth-
grade pupils were used, the reason for using them should be stated. If the samples were
randomly selected, this should be said. The method of random sampling should also be
specified. If pupils were assigned at random to experimental groups, this should be re-
ported. If they were not, this, too, should be reported with reasons for the lack of such
assignment.
The method of testing the hypotheses should be reported in detail. If the study has
been experimental, the manner in which the independent variable(s) has been manipulated
is described. This description includes instruments used, instructions to the subjects,
control precautions, and the like. If the study has been nonexperimental, the procedures
used to gather data are outlined.
The report of any empirical study must include an account of the measurement of the
variables of the study. This may be accomplished in few sentences in some studies. For
example, in an experiment with one independent variable and a dependent variable whose
measurement is simple, all that may be necessary is a brief description of the measurement
of the dependent variable. Such measurement may entail only the counting of responses.
In other studies, the description of the measurement of the variables may take up most of
the methodology section. A
factor analytic study, for instance, may require lengthy de-
scriptions of measurement instruments and how they were used. Such descriptions will, of
course, include justification of the instruments used, as well as evidence of their reliability
and validity.
An account of the data analysis methods used is sometimes put into the methodology
''G. Haslerud and S. Meyers, "The Transfer Value of Given and Individually Derived Principles." Journal
This part of the report, though logically a unit, is often broken down into two or three
sections. We treat it here as one section, since the interpretation of results and the conclu-
sions drawn from the results are so often reported together in journal research reports. In
a thesis or book, however, it may be desirable to separate the data from their interpretation
and from the conclusions.
The results or data of a research study are the raw materials for the solution of the
research problem. The data and their analysis are the hypothesis-testing stuff of research.
Methodology and data collection are tools used to obtain the raw material of hypothesis-
testing, the data. The main question in this: Do the data support or not support the
hypotheses'? It cannot be emphasized enough that methodology, data collection, and anal-
ysis are selected and used for the purpose of testing the operational hypotheses deduced
from the general research questions. Therefore the report writer must be exceptionally
careful to report his results as accurately and completely as possible, informing the reader
how the results bear on the hypotheses.
Before writing this part of the report, it is helpful to reduce the data and the results of
the data analysis to condensed form, particularly tables. The data should be thoroughly
digested and understood before writing. The answer to the question. Do the data support
the hypotheses? must be clearly answered before writing the results section. While writ-
ing, one must be constantly on guard against wandering from the task at hand, the solution
of the research problem. Everything written must be geared to letting the data bear on the
problem and the hypotheses.
Somewhere in the final section of the research report the limitations and weaknesses of
the study should be discussed. This can be overdone, of course. All scientific work has
weaknesses, and many pages can be written belaboring a study's weaknesses. Still, the
major limitations, which, of course, may have been mentioned earlier when discussing the
'In the report of an experimental phase of a larger complex study, the writer and a colleague evidently lost
sight of this precept. A severe but perspicacious critic who read the final report noted that the basic hypothesis of
the experiments had not really been tested. Unfortunately, the critic was right. The experimental project had to
be scrapped! For an account of the research and the project's demise, see F. and Consert--
Kerlinger, Liberalism
alism: The Nature and Structure of Social Attitudes. Hillsdale, N. J.: Lawrence Eribaum Associates, 1984,
chap. 12.
The Research Report •
649
problem or the methodology, should be pointed out. This is done, not to show humility or
one's technical competence, but rather to enable the reader to judge the validity of the
conclusions drawn from the data and the general worth of the study.
Limitations of social scientific and educational research generally come from sam-
pling and subject assignment inadequacies, methodological weaknesses, and statistical
deficiencies. Lack of random sampling, as we have seen, limits the conclusions to the
particular sample used. Lack of random assignment casts doubt on the adequacy of the
control of independent variables and thus on the conclusions. Statistical deficiencies,
similarly, can lead to incorrect conclusions. Deficiencies in measurement always affect
conclusions, too. If a measurement instrument, perhaps through no fault of the writer, is
only moderately reliable, a finding may be ambiguous and inconclusive. More important,
the questionable validity of an instrument may seriously affect conclusions.
These matters have been discussed in the text and need no further elaboration here. It
may be added, however, that the writing of the conclusions is naturally affected by the
recognized and acknowledged limitations and weaknesses. Moreover, readers can hardly
be expected to judge the validity of research conclusions without knowing both the posi-
tive and the negative aspects of whatever was done. It is the professional responsibility of
the researcher, then, to inform readers of both the strengths and the weaknesses of the
research.
THE WRITING
It is not easy to write simply and clearly. One has to work at it. One should realize that
almost no writer can escape the necessity of constant revision by reorganizing and paring
deleting circumlocutions, redundancies, and other verbal fat. Suggestions for better re-
search report writing follow.
Although research reports should be fairly detailed, there is no need to waste words.
State the problem, the methodology, and the results as clearly, simply, and briefly as
possible. Avoid hackneyed expressions like "in terms of," "with respect to," "with
reference to," "give consideration to," and the like. Delete unnecessary words and
expressions when revising. For example, sentences with expressions like "the fact of the
matter is," "owing to the fact that," and "as to whether" can always be revised to
remove such clumsy inelegancies. For good advice on simplicity and clarity, study Strunk
and White's little classic. The Elements of Style. Nicholson's book is most helpful.^
Writing scholarly papers and research reports requires a certain amount of routine
drudgery that few of us like. Bibliographies, footnotes, tables, figures, and other mechan-
ical details, however, cannot be escaped. Yet a little systematic study can help solve most
problems. That is, do not wait until you sit down to write and then find out how to handle
footnotes and other mechanical details. Get a good reference book or two and study and
lay out footnote and bibliographical forms, tables, figures, and headings. Put three or four
types of footnote entries on 3-by-5 cards. Similarly, learn two or three methods of laying
out tables. Lay out skeleton tables, Then use these samples when writing. In short, put
much of the drudgery and doubt behind you by mastering the elements of the methods,
instead of impeding your writing by constant interruptions to check on how to do things.
Presentation of statistical results and analyses gives students considerable trouble. Hit
the problem head-on. Perhaps the best way to do this is to study statistical presentation in
two or three good journals, like the Journal of Personality and Social Psychology, the
Journal of Educational Psychology, the American Educational Research Journal, and the
American Sociological Review. The style manual of the American Psychological Associa-
tion (see References) has been adopted by all psychology journals and a number of
education journals. Although a bit fussy, it is an excellent guide, especially to statistical
and tabular presentation. Turabian's manual is another good guide.
The purpose of statistical, tabular, and other condensed presentation should be kept in
*See the references at the end of this appendix.
650 •
Appendix D
mind. A should clearly tell the reader what the data essen-
statistical table, for instance,
tially say. This does not mean, of course, that a statistical table can stand by itself. Its
purpose is and clarify the textual discussion. The text carries the story; the
to illuminate
table helps make and gives the statistical evidence for assertions made in the
the text clear
text. The text may say, for example, "The three experimental groups differed signifi-
cantly in achievement," and the tables will report the statistical data —
means, standard
deviations, F ratios, levels of significance —
to support the assertion. There is often no
need for a table. If a hypothesis has been tested by calculating one, two, or three coeffi-
cients of correlation, these can simply be reported in the text without tabular presentation.
A fairly safe generalization to guide one in writing research reports is: first drafts are
not adequate. In other words, almost any writing, as said earlier, improves upon revision.
It is almost always possible to simplify first-draft language and to delete unnecessary
words, phrases, and even sentences and paragraphs. A first rule, then, is to go over any
report with a ruthless pencil toward the end of greater simplicity, clarity, and brevity.
With experience this not only becomes possible; it becomes easier.
If an adequate outline has been used, there should be little problem with the organiza-
tion of a research paper. Yet sometimes it is necessary to reorganize a report. One may
find, for example, that one has discussed something at the end of the report that was not
anticipated in the beginning. Reorganization is required. In any case, the possibility of
improvement in communication through reorganization should always be kept in mind.
Anyone's research writing can be improved in two ways: by letting something one has
written sit for a few weeks, and by having someone else read and criticize one's work. It
is remarkable what a little time will do for one's objectivity and critical capacity. One sees
obvious blemishes that somehow one could not see before. Time helps salve the ego, too.
Our precious inventions do not seem so precious after a few weeks or months. We can be
much more objective about them.
The second problem is harder. It is hard to take criticism, but the researcher must learn
to take it. Scientific research is one of the most complex of human activities. Writing
research reports is not easy, and no one can be expected to be perfect. It should be
accepted and routine procedure, therefore, to have colleagues read our reports. It should
be accepted routine, too, to accept our readers' criticisms in the spirit in which we should
have asked for them. There is of course no obligation to change a manuscript in line with
criticism. But there is an obligation to give each criticism the serious, careful, and objec-
tive attention it deserves. Doctoral students have to consider seriously the criticisms of
their sponsors —
whether or not they like them or agree with them. All scholarly and
scientific writers, however, should voluntarily learn the discipline of subjecting their work
to their peers. They should learn that the complex business of communicating scholarly
and scientific work is difficult and demanding, and that in the long run they can only profit
from competent criticism and careful revision.
SCHMID, r. Suiiisiical Graphics: Desif^n Principles anil Practices. New York; Wiley, 1983. In
writing reports, we have paid too little attention to the graphic presentation of statistical
data. This book discusses and illustrates the principles of good graphic procedures.
STRUNK, w., and white, e. The Elements of Style. 3rd ed. New York: Macmillan, 1979. This
little gem, which every writer should own, is dedicated to clarity, brevity, and simplicity.
TURABIAN, K. A Manual for Writers of Term Papers, Theses, and Dissertations. 4th ed. Chi-
cago; University of Chicago Press, 1973. An excellent, invaluable reference. Can well be
called the handbook of the doctoral student. It is based on the Manual of Style.
Name Index
Bower, G., 13. 145, 365 Clifford, M. (com.) Dixon. W., 172, 575
Boyer, E., 489 365, 366 Doctorow, M., 24
Bracht. G., 244 Cline, v., 473 Dohrenwend. B., 441, 444. 447
Bradley. J.. 133. 260, 270, 272, Clore, G., 25 Dollar, C, 623
274 Coan, R., 593 Dollard, J., 18, 29
Bradshaw. G.. 635 Coch, L., 370 Donnerstein, E., 376
Brain. W. R.. 14 Cochran, G., 567 Doob, A., 448
Braithwaite, R., 8, 20 Cochran, W., 120, 194, 199, 327, Doob, L. 18
Branett, J.. 317 403, 538 Douvan. E.. 478
Branscomb. J., 635 Cohen, J., 553, 557, 561, 567 Dowaliby, F., 34
Braud. L., 73 Cohen, M., 6, 15, 55. 357 Doyle. P., 25. 39. 354. 489
Braud, W., 73 Coleman. J., 25, 132, 352, 546, 548 Driver. M.. 478
Bray, J.. 564 Collier, R., 250 Dubos, R., 14
Brodbeck, M., 167 Conant, J., 4, 7. 12 DuBridge, L. A., 14
Broen, W., 513 Conger, L., 470 Dukes, W., 456
Brooks, H., 14 Conkin, P., 624 Duncan, B., 38, 142
Brooks, W.. 567 Converse, P., 142, 143, 173. 378. Duncan, O., 38. 142
Bross. I.. 175 443, 448 Dunkin. M.. 352. 505
Brown. M.. 172 Cook, S., 16, 39, 223. 318, 372, Dutton. D.. 226. 229, 371, 375
Brown. R.. 453 373, 467, 502 Dworkin, G., 466
Brown. S.. 519 Cooley, W., 85, 566, 638 Dwyer, P., 222
Brown, W., 38, 528 Coombs, C 53
Bruner, J., 53 Cooper, J.. 64. 244 Ebel, R., 413, 432, 467, 496, 549
Bryant. R, 470 Corballis, M,, 217 Edwards, A., 131, 199, 221, 223,
Bugelski. B.. 262 Costin, P., 25 239. 251. 257. 282, 326, 338,
Burington, R., 538 Gotten, M., 227, 229 452, 455. 456, 459, 461, 467,
Burke, P., 169 Cowles, M., 221 514
Burns, R., 285, 465 Crancer, A., 251 Ekstrom, R.. 467, 594
Euros, O., 451, 467 Crandall, J., 462 Ellertson, N., 24
Burt. M., 25. 447, 454 Cronbach, L., 137, 244, 311, 407, Ellis. H., 180
Bush, R., 101, 270, 273, 274, 403 413. 417, 419. 420, 432, 518, Eron, L., 472
Byrne. D.. 499. 500. 501, 503 519 Etzioni, A., 634
Crutchfield, R., 453 Evans, C, 626, 636
Cahen, L., 244 Cureton. E., 432
Campbell, A., 142, 173, 377, 378, Cutright, P.. 25, 34, 138, 546, 567 Feather, N., 395
381. 384. 388. 443. 448, 458 Czikszentmihalyi, M., 520 Featherman, D., 38, 142
Campbell, D., 295, 296, 300, 301, Felder, D., 244
310. 313, 314, 317. 423 Darlington, R., 540, 567 Feller, W., 110, 117, 268
Campbell, E., 25, 352 Darwin, Charles, 20 Festinger, L.. 364, 371, 377, 382,
Cannell, C, 387, 440. 441. 445. Davidson, K.. 74, 358 502
447, 487 Davis, C,221 Field, W., 475
Cantril, H.. 388. 448 Davis, R., 635 Fienberg, S., 500
Carlsmith, J., 343, 369, 376 DeCharms, R., 484 Firkowska, A., 24, 546. 568
Carroll, J, 593 Deci, E., 24, 284, 370 Fisher. G., 29, 75, 568
Carter, D,. 97 DeCorte, E., 200 Fisher, R., 135, 136, 156, 190.275,
Cataldo, E., 521 Denicoff. M., 627 341, 376
Cattell, R., 579, 587 Depner, C, 478 Fisher, R. A., 114. 291
Chamberlin. T., 357 Dertouzos, M., 627, 630, 636 Fiske, D., 423
Chase, C, 432 Deutsch, K., 592 Flacks, R., 366
Chave. E.. 454. 467 Deutsch, M., 502 Flanagan, J., 484
Child. 1.. 159. 484 Dewey, J. H. 13. 14. 19 Flowers, M., 323
Clark. C, 24, 61, 339 Diamond, M., 73 Fobes, J., 365
Cleary. T., 24, 130, 247, 280, 365, Diamond. S.. 208 Fomess. S., 496
366, 568 DiCara, L., 60, 188. 263. 333, 341, Frederiksen. N.. 24. 562
Cliff, N., 572. 617 365 Free, L., 388, 448
Clifford. M.. 24, 130, 247, 280, Dillehay, R.. 457 Freedman, A., 199, 249
Name Index • 655
Clock, C, 388, 448 Hilgard, E., 13. 48. 145, 365 505. 626. 627, 630, 635, 636
Glucksberg, S., 16 Hill, P., 617 Kendall, M., 269, 273
Goethals, G., 250 Hiller, J.. 29, 75, 484, 568 Kenny, D., 567
Goetz, E., 250 Hobson, C, 25, 352 Kerlinger, P.. 7, 8, 9, 10, 18, 36,
Goff. S., 625 Hodges, A.. 628 66, 126, 130, 139, 149, 153,244,
Goldberg, P., 187 Hogan, R., 456 300, 327, 348, 351, 359, 361,
Goldstein, M., 625 Holland, B., 625 420, 421, 423, 424, 447, 449,
Goldstine. H., 628, 636 Hollenbeck, A,, 488 453, 457, 497, 511, 514, 516,
Golightly, C, 223 484
Holsti, O., 477. 482. 517, 521, 533, 542, 549, 553,
Goodnow, J., 53 Holtzman, W,. 38. 528 554, 557, 566, 567, 572, 583,
Gordon. L.. 40. 388. 448 Homant, R., 480 584, 588, 594, 597, 609, 616,
Gordon. R., 540 Hoyt, C., 413 617
Gorsuch, R.. 594, 595 Hoyt. K., 325, 375 Kershner, R., 46, 57, 58, 68
Goss. N.. 386, 388 Hummel, T., 250 Kidder, T., 636
Gottesdiener, M., 187 Humphreys. L., 587, 594 Kinder, D,, 385
Gottman, J., 317 Hunt, E., 53 King, L., 16
Green. B.. 636, 638 Hunt, J., 13 Kirk, R.. 157, 218, 221, 250, 268,
Green, P.. 567, 608, 623 Hurlock. E., 5, 16, 213, 219, 220, 304, 338, 397
Greenberg, C, 160 375 Kirscht, J., 457
Greene. D., 24, 145, 284. 370 Husek. T.. 427 Kish, L., 120
Gregory, D., 24 Hyman. H., 352 Klein, D., 447
Grier, J., 544 Kluckhohn, C, 456
Gnffiths. D., 562 Imber. L.. 17. 250 Knoke, D., 169
Grigg, C., 559 Inhelder. B., 53 Koenig, K, 366
Gross, A.. 343 Isaacson, R., 497, 594 Kolb, D., 341
Grush, 25
J., Kolchin, E., 625
Guba, E., 24 Jackson, D., 432, 521 Kounin, J., 25. 39, 354, 375, 489
Guertin. W., 594 Jackson, P., 448, 472, 480 Krauss. E.. 142
Guilford. J.. 185. 397. 403. 407, Jaeger, R., 142, 426.457 Krech, D.. 73, 453
428, 432. 460, 461, 462, 493, Janis. I., 469
187, 323, Krejcie, R., 282
494. 495, 496. 521, 574. 579. Jaspen, N., 538, 625 Kruskal. W.. 270. 271
591 Jensen, A.. 466 Kuhn. R., 388
656 • Name Index
Kulka, R.. 478 Marjoribanks, K., 25, 492, 568 Moynihan. D.. 352
Markoff, J,, 477, 484 Mulaik. S.. 588
Lagace, 470 Marks, C. 24 Murdok, G.. 470
Lake, R., 226, 229, 371, 375 Marshall, T.. 582
Landy, D., 250 Martin, T.. 484 Nagel, E., 6, 55, 357
Langer, E., 17, 250 Maruyama. G.. 611 Nanda, H.. 407
Langley, P., 635 Mason. W.. 386. 388 Nesselroade, J., 296
Laplace, Simon, 90 Maxwell. S.. 564 Neumann, John von, 627-628
Laughlin. P.. 244 May, D., 538 Newcomb. T.. 40. 366-367. 453.
Lave, L., 25, 138, 527, 528 Mayeske. G.. 585 503
Layton, W.. 545 McBnde. D,. 24 Newman. J.. 90. 179
Lee, R., 457, 594 McCarthy. P.. 120 Nichols. T.. 519
Lepper, M., 24, 145, 284, 370 McClelland. D., 34, 342, 355, 356, Nie, N.. 143, 353, 376, 383, 384,
Levin, H., 443. 474 360, 362, 373, 472, 473. 482. 448, 467, 588. 589, 610
Levine, E., 159 484. 632 Nisbett, R., 9, 24, 25, 145, 284,
Levinson, D., 351 McClosky. H.. 559 370
Levitin, T., 150, 388, 456 McCracken. D.. 635. 636 Nitko, A., 434, 465
Lewis, R., 611 McDonald. P.. 39 Noell-Neuman. E.. 444
Lewis-Beck, M., 566 McEachem. A.. 386. 388 Noll. v.. 549
Li, T., 470 McFall. R.. 317 Norman. D.. 40
Lieberson, S., 97 McGaw. B.. 617 Northrup. F.. 30. 55
Lighthail, R, 74, 358 McGee. M.. 186. 187. 339. 489 Norton. A.. 267
Likert, Rensis, 388. 467 McNemar. Q., 199 Novick, M., 407
Lindquist. E.. 267. 268, 337 McPartland, 352
J., 25, Nunnally, J., 137, 157, 397. 403.
Lindzey, G., 387. 403. 466. 467. Meadows. D. H., 363 427. 432. 434. 454. 460. 465.
472, 477, 480, 499, 500, 501, Meadows. D. L., 363 493. 571. 594
503. 513 Medley, D., 487, 488, 496, 505
Lininger. C. 120. 377. 380, 382. Meehl. P.. 417. 432 O'Connor. E.. 310
383. 384, 400, 442, 443. 444, Mehrens. W.. 432, 467 Olds, M., 365
447 Meier, K., 25, 97, 161 O'Malley, P., 25, 39, 122, 568
Linn, R., 594 Menaker, S., 473 Ostrove, N., 324
Lippitt, R.. 487. 505 Mendelssohn G., 456 Ostrowska, A., 24
Livson. N.. 519 Menlove, F., 335, 375 Ove, F.. 500
Loehlin, J.. 466 Merei, F., 493 Overall. J.. 594
Loevinger. J..425 Merton, R., 309
Lohnes. P.. 85, 566, 582, 638 Messick, S., 432 Pace. C. 467
Longabaugh, R., 594 Mezei, L., 504 Page. E.. 273. 483
Lord, F., 268, 407 Middendorp, C. 447. 588 Page. R.. 568
Lortie. D., 388, 448 Milbum. M.. 609. 617 Papert. S.. 630
Lott, A., 136 Miller, D.. 51 Parke. R.. 40, 142
Lott, B., 136 Miller, N., 18, 39, 60, 188, 263, Parten, M., 377, 380, 383
Lubin, A., 239 333, 341, 364, 365, 367. 611 Pavlov, L, 365
Lucas, P., 267 Miller, W.. 150. 173. 388 Pearson, E.. 135, 165, 273
Ludlow, G., 432 Mills. J.. 222 Peckham, P., 266
Lunt, P., 38 Mitlman. A.. 616 Pedhazur, E.. 36, 126, 139, 142,
Mitzel. H.. 487. 488. 496. 505 242, 327. 361, 420. 475. 505.
MacCallum, R., 633, 634 Moeller, G., 484 533. 539. 540. 542. 544. 553.
Maccoby, E., 443 Monsteller, F.. 270, 273, 274 554. 557. 558. 564, 565, 566,
MacDonald. G., 448 Montanelli, R., 594 568. 572
MacMillan. D, 497 Mood, A., 25, 352 Peeck, J., 320, 328
Madsen, David, 620 Morgan, D., 282 Peirce, C, 6, 7
Magidson, J., 596 Morrison, G., 497 Penner, L., 480
Mann, J., 474 Morrow, J., 261, 262, 375 Perlis, A., 630
Mann, L., 187 Moses, J., 627, 630, 636 Peterson, D., 594
Marcotte, D., 484 Mosteller, F., 101, 352, 403, 480 Peterson, N., 267
Margenau, H., 28, 30, 80, 180 Mowrer. O.. 18 Petrocik, J.. 143
Name Index • 657
Petronovich, I..267 Sarason, S., 74, 358 Stanley, J.. 295, .300, .301, 310,
Piaget. J.. 53, 359 Sargent, H., 472 313, 317
Pisani, R.. 131, 199 Saris, W.. 617 Stark, R., 388, 448
Pittel, S., 456 Sarnoff, 1., 74, 358 Stein, Z.. 24
Plait, J., 357. 592 Sax, G.. 451, 457 Steiner, I., 475
Purvcs. R.. 131, 199 Schroder, H., 478 Stokes, P., 173
Schuman, H., 444, 448 Stone, P., 482, 483, 484
Quinn. R.. 40. 388. 448 Schunier, F., 472 Storm, L., 625
Schumcr, H., 34 Stouffer, S., 160. 296, 298, 375,
Rajaratnam. N.. 407 Scott, W., 463 442. 443. 448
Randers, J., 363 Sears, D.. 385 Streufert. S.. 478
Rank. A., 24. 58. 261. 338. 339, Sears. R.. 18, 443 Suboviak. M.. 466
375, 478 Seashore, S., 388 Suedfeld. P.. 24, 58. 261. 338. 339,
Rankin, G.. 625 Senghaas. D.. 592 375, 478
Rankin, R., 625 Seskin E.. 25. 138, 527, 528 Suits, D., 553
Rapier, J., 244 Sessions. V.. 470 Sundland. D.. 518
Reed, J., 352 Shapiro. G.. 477, 484 Susser. M.. 24
Reillv, R., 594 Shaver, K., 250 Swanson. E., 545
Rcnner, V., 135. 342 Shaver, P., 467 Swanson, G., 51
Reynolds. H., 158 Shaw, M., 467
Reynolds. R. 250 Sheehy, E., 470
Richards, J., 473 Sheldon, E.. 40. 142 Tabor, J.. 40. 388. 448
Richardson, S., 441, 444, 447 Sherif, M.. 368 Taleporos, E.. 461
Rightmire, G., 568 Sigall. H.. 250. 324 Tate. M., 133
Robertshaw, D., 617 SiFberman, M., 448 Tatsuoka, M., 562
Robinson, J., 143, 467 Silverstein, B., 223 Taylor, G., 17, 187
Rock. L.. 250 Simmons, R., 128. 161. 447 Tetenbaum, T., 476, 563
Rodgers. W.. 142. 378. 443. 448 Simon. A., 489 Tetlock, P., 261, 375, 469, 484
Rokeach. M., 351. 453, 456. 457. Simon, H., 361, 635 Thompson, G., 14, 45, 95, 500, 505
480. 504. 594 Simpson, G. G., 14 Thompson, J., 594
Rorer. L.. 454 Skinner, B. R, 30. 484 Thompson, S., 122, 285
Rosch. E.. 16. 24. 53. 341 Smith. J.. 53 Thompson, W., 100, 262
Rosenberg, M., 128, 161. 447 Smith, L., 227. 229 Thorndike, E., 307, 333
Rosenzweig. M., 19, 73 Smith. M.. 503. 504 Thomdike. R., 296, 310, 375, 419,
Ross. J., 19 Smith. N.. 625 432, 434
Ross, L,, 9, 25 Smithson. B.. 261. 262, 375 Thurstone, L., 454, 456, 467, 572,
Rowley, G., 488 Snedecor. G.. 120. 194. 199, 327, 579, 581, 582. 584, 586. 588,
Rozeboom, W., 157 403, 538, 567 593, 595
Rucci, A.. 221 Snell, J., 45, 95, 500, 505 Thurstone, T., 579, 586
Ruckdeschel. R.. 635, 636 Snow. R., 244 Tobias, S., 34
Rulon, P., 567 Snyder, M., 186, 187, 339, 489 Tolman, E., 37
Rusk, J., 143, 467 Sokolowska, M., 24 Torgerson, W,, 28, 30, 397, 402
Russell. B.. 361 Solomon, R., 296, 307, 313, 314 Traub, R., 466
Ryan. T.. 218 Sontag, M., 516 Truing, Alan, 627-628, 636
Sorbom, D., 596, 603, 608, 609, Tryon, R., 407, 413, 432
St. John, N., 611 617 Tukey. J., 480
Salmon, W., 90 Sorenson, A., 427 Turgeon V., 474
Sanders, J., 266 Spencer, C 474 Turner, C, 142
Turner, M., 90
Sanford, R., 351 Spuhler. J., 466
658 • Name Index
Tweney, R., 221 Warden, E., 474 Wittrock, M,, 24, 223, 375
Warner, W., 38 Wold, H., 361
Underwood, B., 309 Warr, P., 53, 457, 594 Wolfe, R., 466
Warwick, D., 120, 366, 377, 379, Wolfle, L., 617
Vaughan. G., 217 380, 382, 383, 384, 440, 442, Wolfson. B., 448
Veldman, D., 473 443, 444, 447 Woodmansee, J., 467
Verba, J., 143 Wasserman, S., 500 Woodrow, H., 591
Verba, S., 353, 375, 383, 384, 448, Webb, E., 475 Woodward, J., 609, 617
467, 588, 589, 610 Webb, W., 17 Wood worth, R., 307
Vernon, P., 467, 513 Weber, M., 355 Wortman, P., 470
Veroff, J., 478 Webster, H., 521 Wright, C, 352
Verplanck, W., 39, 135 Weick, K., 487, 488, 491, 493, Wright, H., 492
Verschaffel, L., 200 498, 505 Wright, J., 467
Vorst, H, 632 Weinfeld, F., 25, 352 Wright, R., 200
Weitman, S., 477, 484
Waite, R., 74, 358 Weizenbaum, J., 628, 636 Yarrow, L., 440
Walberg, H., 24, 61, 339, 563 Wexelblatt, R., 629, 636 Yates, R, 135, 275
Wald, I., 24 White, P., 594 York, R., 25. 352
Walker, H., 155 Whitehead, A., 4 Yu, C, 427
Wallington, S., 13, 29, 60, 170, 342 Whiting, J., 484
Wallis,W., 270, 271 Wickens, D., 321 Zajonc, R., 25
Walowitz, H., 625 Wijgaart, C. van de, 625 Zavala, A.. 463
Walster, E.,24, 130,247,280,281, Wilcox, L., 46, 57, 58, 68 Zubin, J., 472
365, 366 Winter, D., 342, 484
Ward, L., 250 Wittenbom, J., 512
Subject Index
Fluid and crystallized intelligence. Hypothesis statements (cont.) Interview schedules (cont.)
587-588
Cattell's study of, 17-18 445
Forced-choice items (scales), 461- Hypothesis-testing, 4, 13 information, kinds of, 441
463 available materials and, 469 items, 441-444
FORTRAN (computer language), null, 189-190 fixed-alternative. 442
629, 630, 635 standard error and, 186 open-end, 442-443
Freedom, degrees of, 155 statistical, 189 scale, 443-444
Frequencies, 126-127 substantive, 189-190 value of, 446
analysis of, 147-173 Interviews, 438-448
Frequency analysis, 128 In-Basket Test, 562 purposes of, 400
Frequency data Independence, 98-102 shortcoming of, 440
multivariate analysis of, 166-170 of observations, 268-269 survey research and, 379-380
using normal probability curve, statistical, 268-269 as tools of science, 440-445
interpretation of, 182-183 Independent variables, 32-34 types of, 441
Frequency distributions, 131 correlation among, 558-559 value of, 446
Friedman test, 271-273 Indicants of objects, 396 Ipsative measures,463
Funnel questions, interview sched- Indicators 462
Irrelevants,
ules, 443 educational, 458 Isomorphism, "reality," measure-
social. 141-142, 457-458 ment and, 395-396
General Inquirer, 482-483 Indices, 140-141
Generality, 10 defined, 140 Kinder and Sears survey of preju-
Generalizability, criterion of good sociometric, 502-503 dice and politics, 385
design, 299-300 Inferences, statistical, 176, 198 Knowing, methods of, 6-7
Glock and Stark's measure of anti- Inquiry, multivariate, logic of, 65- Kruskal-Wallis test, 269-271
Semitism, 425 66
Graphic rating scale, 494-495 Intelligence Laboratory experiments, 364-365,
Graphing, 132-133 fluid and crystallized, Cattell's 367-369
Graphs, 132-133 study of, 587-588 characteristics of, 367
correlation and, 62-64 Thurstone's factorial study of, defined, 367
directed, 501-502 586 example of, 364-365
factors and factor loadings, repre- Intelligence tests, 451 purpose of, 369
sentation of, 574-575 Interaction, 230-242 strengths of, 367
relations shown by, 59-60 examples of, 230-238, 247-250 weaknesses of, 367-369
Guttman (cumulative) scale, 453, interpretation and, 241-242 Language
455 kinds of, 238-240 scientific, 3
meaning of, 230 specialized use of. 3
Halo effect, 495 ordinal, 238 Large numbers, law of, 179-180
Harvard Third Psychosociological significant, 238 Latent trait theory, testing and, 465-
Dictionary. 483 Interests, measurement of. 457 466
Historical research, 620-622 Interpretation Latent variables, 37-38, 397n, 597,
History, as source of extraneous of data using normal probability 604
variance, 296 curve, 182-185 Learning of autonomic functions,
Hostility, 453 continuous. 183-185 333-335
Human Relations Area Files, 470 frequency, 182-185 Least squares, principle of, 533
Hypotheses, 17-20 principles of, 125-145 Leniency, error of, 495
alternative, testing, 356-358, probability and, 145 Likert-type scale, 453
597-604 proof and, 145 Linear regression, multiple, 531-
criteria for,17-18 of research data, 142-145 536
generality of, 21-22 Intersection of sets, 46-47 Linear Structural Relations, see
importance of, 18-19 Intervalmeasurement, 400 LISREL
multivariate nature of, 22-23 Intervening variable, 37 LISREL, 596-616
special power of, 23 Interview, as research tool, exam-
specificity of, 22 446-447
ples of, Mail questionnaires, 380
virtues, 19-20 Interview schedules, 438-446 Main effects, 230
Hypothesis, defined, 11, 17 as tools of science, 440-445 Mapping, 59
Hypothesis stateinents, criteria for. criteria of question- writing, 444- Materials, available, see Available
Subject Index • 663
Maierials. uvuiluble (com.) Monte Carlo methods {cant.) Nonexperimental research (com.)
materials central limit theorem, 194-195 value of, 359-360
Matrices, swiometric. 500-501 generalizations, 194 Nonobservables, 36-37
Matrix, del'ined, 330n procedure, 192-194 Nonparametric analysis of variance
Maturation, source of extraneous Multiple correlation coefficient, and related statistics, 269-
variance. 2% 536-538 276
Maxmincon principle. 330 Multiple linear regression, 531-536 Nonparametric methods, properties
Mean square. 70. 212 Multiple regression, 138 of, 274-275
Mean(s). 69 Multiple regression analysis, 526- Nonparametric statistics, 266
calculation, 70-71 568 Nonprobability samples, 119-120
differences between. 186-187 analysis of variance, 551 Normal probability curve
standard error of. 195-198 coding and data analysis, 556- data using, interpretation of , 182-
standard error of the. 184. 185. 558, 559-561 185
191. 413 defined, 526 standard deviation and, 180-182
standard variance of the, 72 one-way analysis of variance, Normality, assumption of, 267
Measurement. 389-434 552-556 Normative measures, 463
classification. 398 relative contributions to Y of the Null hypothesis, testing, 189-190
definition of. 391-394 X's, 540-542 Numbers, large, law of, 179-180
dependability in. necessity for. research examples, 526-528, Numerical rating scale, 494
404 544-549
enumeration, 398 scientific research and, 549-550 Objective scales, types of, 458-464
foundations of, 391-403 statistical significance of regres- Objective 449-467
tests,
;'"''i'-t'.'''-.:'VC'-' '-I'w
978003041751