Wickens Ch2 Research Methods
Wickens Ch2 Research Methods
Wickens Ch2 Research Methods
An Introduction
to Human Factors
Engineering
Second Edition
Christopher D. Wickens
University of Illinois at Champaign-Urbana
John Lee
University of Iowa
Viii Liu
University of Michigan
Resea
"
10
Introduction to Research Methods 11
Often there are so few funds available for answering human factors research
questions, or the time available for such answers is so short, that it is impossible
to address the many questions that need asking in applied research designs. As a
consequence, there is a need to conduct more basic, less expensive and risky lab-
oratory research, or to draw conclusions from other researchers who have pub-
lished their findings in journals and books. These research studies may not have
exactly duplicated the conditions of interest to the human factors designer. But
if the findings are strong arid reliable, they may provide useful guidance in ad-
dressing that design problem, informing the designer or applied researcher, for
example, of the driving conditions that might make cellular phone use more or
less distracting; or the extent of benefits that could be gained by a voice-dialed
over a hand-dialed phone.
Step 5. Draw conclusions. Based on the results of the statistical analysis, the re-
searchers draw conclusions about the cause-and-effect relationships in the ex-
periment. At the simplest level, this means determining whether hypotheses
were supported. In applied research, it is often important to go beyond the obvi-
ous. For example, our study might conclude that shiftwork schedules affect
older workers more than younger workers or that it influences the performance
of certain tasks, and not others. Clearly, the conclusions that we draw depend a
lot on the experimental design. It is also important for the researcher to go be-
yond concluding what was found, to ask "why". For example, are older people
more disrupted by shiftwork changes because they need more sleep? Or because
their natural circadian (day-night rhythms) are more rigid? Identifying underly-
ing reasons, whether psychological or physiological, allows for the development
of useful and generalizable principles and guidelines.
Experimental Designs
For any experiment, there are different designs that can be used to collect the data.
Which design is best depends on the particular situation. Major features that differ
between designs include whether each independent variable has two levels or
more, whether one or more independent variable is manipulated, and whether the
same or different subjects participate in the different conditions defined by the in-
dependent variables (Keppel, 1992; Elmes et al., 1995; Williges, 1995).
The Two-Group Design. In a two-group design, one independent variable or fac-
tor is tested with only two conditions or levels of the independent variable. In
the classic two-group design, a control group gets no treatment (e.g., driving
with no cellular phone), and the experimental group gets some "amount" of the
independent variable (e.g., driving while using a cellular phone). The dependent
variable (driving performance) is compared for the two groups. However, in
human factors we often compare two different experimental treatment condi-
tions, such as performance using a trackball versus using a mouse. In these cases,
a control group is unnecessary: A control group to compare with mouse and
trackball users would have no cursor control at all, which does not make sense.
Multiple Group Designs. Sometimes the two-group design does not adequately
test our hypothesis of interest. For example, if we want to assess the effects of
VDT brightness on display perception, we might want to evaluate several differ-
ent levels of brightness. We would be studying one independent variable
(brightness) but would want to evaluate many levels of the variable. If we used
five different brightness levels and therefore five groups, we would still be study-
ing one independent variable but would gain more information than if we used
only two levels/groups. With this design, we could develop a quantitative model
or equation that predicts performance as a function of brightness. In a different
multilevel design, we might want to test four different input devices for cursor
control, such as trackball, thumbwheel, traditional mouse, and key-mouse. We
would have four different experimental conditions but still only one indepen-
dent variable (type of input device).
16 Chapter 2: Research Methods
Factorial Designs. In addition to increasing the number of levels used for ma-
nipulating a single independent variable, we can expand the two-group design
by evaluating more than one independent variable or factor in a single experi-
ment. In human factors, we are often interested in complex systems and there-
fore in simultaneous relationships between many variables rather than just two.
As noted above, we may wish to determine if shiftwork schedules (Factor A)
have the same or different effects on older versus younger workers (Factor B).
A multifactor design that evaluates two or more independent variables by
combining the different levels of each independent variable is called a factorial
design. The term factorial indicates that all possible combinations of the inde-
pendent variable levels are combined and evaluated. Factorial designs allow the
researcher to assess the effect of each independent variable by itself and also to
assess how the independent variables interact with one another. Because much
of human performance is complex and human-machine interaction is often
complex, factorial designs are the most common research designs used in both
basic and applied human factors research.
Factorial designs can be more complex than a 2 X 2 design in a number of
ways. First, there can be more than two levels of each independent variable.
For example, we could compare driving performance with two different cellu-
lar phone designs (e.g., hand-dialed and voice-dialed), and also with a "no
phone" control condition. Then we might combine that first three-level vari-
able with a second variable consisting of two different driving conditions: city
and freeway driving. This would result in a 3 X 2 factorial design. Another way
that factorial designs can become more complex is by increasing the number
of factors or independent variables. Suppose we repeated the above 2 X 3 de-
il!i
sign with both older and younger drivers. This would create a 2 X 3 X 2 de- "
DRIVING CONDITIONS
FIGURE 2.1
The four experimental conditions for a 2 X 2 factorial design.
18 Chapter 2: Research Methods
impairs driving only in heavy traffic conditions [as defined in this partic-
ular study]. When the lines connecting the cell means in a factorial
study are not parallel, as in Figure 2.2, we know that there is some
type of interaction between the independent variables: The effect of
phone use depends on driving conditions. Factorial designs are popular
for both basic research and applied questions because they allow re-
searchers to evaluate interactions between variables.
6,
o earphone
,i
!
I
"
"
!i
0'
• • No car phone
I I
Light traffic Heavy traffic
FIGURE 2.2
Interaction between cellular phone use and driving conditions.
designs are most commonly used when having subjects perform in more than
one of the conditions would be problematic. For example, if you have subjects
receive one type of training (e.g., on a simulator), they could not begin over
again for another type of training because they would already know the mater-
ial. Between-subjects designs also eliminate certain confounds related to order
effects, which we discuss shortly.
Experimental Research Methods 19
use a technique termed counterbalancing. This simply means that different sub-
jects receive the treatment conditions in different orders. For example, half of
the participants in a study would use a trackball and then a mouse. The other
half would use a mouse and then a trackball. There are specific techniques for
counterbalancing order effects; the most common is a Latin-square design. Re-
search methods books (e.g., Keppel, 1992) provide instruction on using these
designs.
In'summary, the researcher must <ontrol extraneous variables by making
sure they do not covary with the independent variable. If they do covary, they
become confounds and make interpretation of the data impossible. This is be-
cause the researcher does not know which variable caused the differences in the
dependent variable.
Data Analysis
Once the experimental data have been collected, the researcher must determine
whether the dependent variable(s) actually did change as a function of experi-
mental condition. For example, was driving performance really "worse" while
using a cellular phone? To evaluate the research questions and hypotheses, the
experimenter calculates two types of statistics: descriptive and inferential statis-
tics. Descriptive statistics are a way to summarize the dependent variable for the
different treatment conditions, while inferential statistics tell us the likelihood
that any differences between our experimental groups are "real" and not just
random fluctuations due to chance.
Descriptive Statistics. Differences between experimental groups are usually de-
scribed in terms of averages. Thus, the most common descriptive statistic is the
mean. Research reports typically describe the mean scores on the dependent
variable for each group of subjects (e.g., see the data shown in Table 2.1 and
Figure 2.2). This is a simple way of conveying the effects of the independent
variable(s) on the dependent variable. Standard deviations are also sometimes
given to convey the spread of scores.
Inferential Statistics. While experimental groups may show different means
for the various conditions, it is possible that such differences occurred solely on
the basis of chance. Humans almost always show random variation in perfor-
22 Chapter 2: Research Methods
mance, even without manipulating any variables. It is not uncommon to get two
groups of subjects who have different means on a variable, without the differ-
ence being due to any experimental manipulation, in the same way that you are
likely to get a different number of "heads" if you do two series of 10 coin tosses.
In fact, it is unusual to obtain means that are exactly the same. So, the question
becomes, Is the difference big enough that we can rule out chance and assume
the independent variable had an affect? Inferential statistics give us, effectively,
the probability that the difference between the groups is due to chance. If we can
rule out the "chance" explanation, then we infer that the difference was due to
the experimental manipulation.
For a two-group design, the inferential statistical test usually used is a Hest.
For more than two groups, we use an analysis of variance (ANOVA). Both tests
yield a score; for a Hest, we get a value for a statistical term called t, and for
ANOVA, we get a value for F. Most important, we also identify the probability, p,
that the tor F value would be found by chance for that particular set of data if
there was no effect or difference. The smaller the p probably is, the more signifi-
cant our result becomes and the more confident we are that our independent
variable really did cause the difference. This p value will be smaller as the differ- ,
Drawing Conclusions
Researchers usually assume that if p is less than .05, they can conclude that the
results are not due to chance and therefore that there was an effect of the inde-
pendent variable. Accidentally concluding that independent or causal variables
had an effect when it was really just chance is referred to as making a Type I
error. If scientists use a .05 cutoff, they will make a Type I error only one time in
20. In traditional sciences, a Type I error is considered a "bad thing" (Wickens,
1998). This makes sense if a researcher is trying to develop a cause-and-effect
model of the physical or social world. The Type I error would lead to the devel-
opment of false theories.
Researchers in human factors have also accepted this implicit assumption
that making a Type I error is bad. Research where the data result in inferential
statistics with p > .05 is not generally accepted for publication in most journals.
Experimenters studying the effects of system design alternatives often conclude
that the alternatives made no difference. Program evaluation where introduc-
tion of a new program resulted in statistics of p > .05 often conclude that the
new program did not work, all because there is greater than a l-in-20 chance
that spurious factors could have caused the results.
The cost of setting this arbitrary cutoff of p = .05 is that researchers are
more likely to make Type II errors, concluding that the experimental manipula-
tion did not have an effect when in fact it did. (Keppel, 1992). This means, for
Experimental Research Methods 23
example, that a safety officer might conclude that a new piece of equipment is
no easier to use under adverse environmental conditions, when in fact it is eas-
ier. The likelihood of making Type I and Type II errors are inversely related.
Thus, if the experimenter showed that the new equipment was not statistically
significantly better (p < .05) than the old, the new equipment might be rejected
even though it might actually be better, and if the p level had been set at 0.10 in-
stead of .05, it would have been concluded to be better.
The total dependence of researchers on the p ~ .05 criterion is especially
problematic in human factors because we frequently must conduct experiments
and evaluations with relatively low numbers of subjects because of expense or
the limited availability of certain highly trained professionals (Wickens, 1998).
As we saw, using a small number of subjects makes the statistical test less power-
ful and more likely to show no significance, or p > .05, even when there is a dif-
ference. In addition, the variability in performance between different subjects or
for the same subject but over time and conditions is also likely to be great when
we try to do our research in more applied environments, where all confounding
extraneous variables are harder to control. Again, these factors make it more
likely that the results will show no significance, or p > .05. The result is that
human factors researchers frequently conclude that there is no difference in ex-
perimental conditions simply because there is more than a l-in-20 chance that it
could be caused by random variation in the data.
In human factors, researchers should consider the probability of a Type II
error when their difference is not significant at the conventional .05 level and
consider the consequences if others use their research to conclude that there is no
difference (Wickens, 1998). For example, will a safety-enhancing device fail to be
adopted? In the cellular phone study, suppose that performance really was worse
with cell phones than without, but the difference was not quite big enough to
reach .05 significance. Might the legislature conclude, in error, that cell phone use
was "safe"? There is no easy answer to the question of how to balance Type I and
Type II statistical errors (Keppel, 1992; Nickerson, 2001). The best advice is to re-
alize that the higher the sample size, the less either type of error will occur, and to
consider the consequences of both types of errors when, out of necessity, the sam-
ple size and power of the design of a human factors experiment must be low.
However, especially for applied research, we must look at the difference between
tbe two groups in terms of practical significance. Is it wortb spending millions to
place simulators on every military base to get an increase from 80 percent to 83
percent? This illustrates the tendency for some researchers to place too much
emphasis on statistical significance and not enough emphasis on practical sig-
nificance.
DESCRIPTIVE METHODS
While experimentation in a well controlled environment is valuable for uncov-
ering basic laws and principles, tbere are often cases where research is better
conducted in tbe real world. In many respects, tbe use of complex tasks in a real-
world environment results in more generalizable data that capture more of tbe
characteristics of a complex, real-world environment. Unfortunately, conducting
research in real-world settings often means tbat we must give up the "true" ex-
perimental design because we cannot directly manipulate and control variables.
One example is descriptive research, where researchers simply measure a number
of variables and evaluate how tbey are related to one anotber. Examples of this
type of research include evaluating the driving behavior of local residents at var-
ious intersections, measuring how people use a particular design of ATM (auto-
matic teller machine), and observing workers in a manufacturing plant to
identify the types and frequencies of unsafe behavior.
Observation
In many instances, human factors research consists of recording behavior during
tasks performed under a variety of circumstances. For example, we might install
video recorders in cars (with the drivers' permission) to film tbe circumstances in
which they place or receive calls on a cellular phone during tbeir daily driving.
In planning observational studies, a researcher identifies the variables to be
measured, the methods to be employed for observing and recording each vari-
able, conditions under which observation will occur, the observational time-
frame, and so fortb. For our cellular phone study, we would develop a series of
"vehicle status categories" in which to assign each phone use (e.g., vehicle
stopped, during turn, city street, freeway, etc.) These categories define a
taxonomy. Otherwise, observation will result in a large number of specific pieces
of information that cannot be reduced into any meaningful descriptions or con-
clusions. It is usually most convenient to develop a taxonomy based on pilot
data. This way, an observer can use a checklist to record and classify each in-
stance of new information, condensing the information as it is collected.
In situations where a great deal of data is available, it may be more sensible
to sample only a part of the behavioral data available or to sample behavior dur-
ing different sessions ratber tban all at once. For example, a safety officer is bet-
ter off sampling tbe prevalence of improper procedures or risk-taking behavior
on tbe shop floor during several different sessions over a period of time than all
at once during one day. The goal is to get representative samples of behavior,
Descriptive Methods 25
and this is more easily accomplished by sampling over different days and during
different conditions.
that the value of one can be somewhat predicted by knowing the value of the
other. For example, in a positive correlation, one variable" increases as the value
of another variable increases; for example, the amount of illumination needed
to read text will be positively correlated with age. In a negative correlation, the
value of one variable decreases as the other variable increases; for example, the
intensity of a soft tone that can be just heard is negatively correlated with age. By
calculating the correlation coefficient, r, we get a measure of the strength of the
relationship. Statistical tests can be performed that determine the probability
that the relationship is due to chance fluctuation in the variables. Thus, we get
information concerning whether a relationship exists (p) and a measure of the
strength of the relationship (r). As with other statistical measures, the likelihood
of finding a significant correlation increases as the sample size-the number of
items measured on both variables-increases.
One caution should be noted. When we find a statistically significant corre-
lation, it is tempting to assume that one of the variables caused the changes seen
in the other variable. This causal inference is unfounded for two reasons. First,
the direction of causation could actually be in the opposite direction. For exam-
ple, we might find that years on the job is negatively correlated with risk-taking.
While it is possible that staying on the job makes an employee more cautious, it
is also possible that being more cautious results in a lower likelihood of injury or
death. This may therefore cause people to stay on the job. Second, a third vari-
able might cause changes in both variables. For example, people who try hard to
do a good job may be encouraged to stay on and may also behave more cau-
tiouslyas part of trying hard.
cally modeled and coded into a runnable simulation program. Various scenarios
are run, and the model shows what would happen to the system. The predictions
of a simulation can be validated against actual human performance (time, er-
rors, workload). This gives future researchers a powerful tool for predicting the
effects of design changes without having to do experiments. One important ad-
vantage of using models for research is that they can replace evaluation using
human subjects to assess the impact of harmful environmental conditions (Kan-
towitz, 1992; Moroney, 1994).
Literature Surveys
A final research method that should be considered is the careful literature search
and survey. While this often proceeds an experimental write-up, a good litera-
ture search can often substitute for the experiment itself if other researchers
have already answered the experimental question. One particular form of litera-
ture survey, known as a meta-analysis, can integrate the statistical findings of a
lot of other experiments that have examined a common independent variable in
order to draw a collective and very reliable conclusion regarding the effect of
that variable (Rosenthal & Reynard, 1991).
ETHICAL ISSUES
I
It is evident that the majority of human factors research involves the use of peo- I
ple as participants in research. Many professional affiliations and government
agencies have written specific guidelines for the proper way to involve partici-
pants in research. Federal agencies rely strongly on the guidelines found in the
Code of Federal Regulations HHS, Title 45, Part 46; Protections of Human Sub-
jects (Department of Health and Human Services, 1991). The National Institute
of Health has a Web site where students can be certified in human subjects test-
ing (http://ohsr.od.nih.gov/cbtl). Anyone who conducts research using human
participants should become familiar with the federal guidelines as well as APA
published guidelines for ethical treatment of human subjects (American Psy-
chological Association, 1992). These guidelines fundamentally advocate the fol-
lowing principles:
Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.
Alternative Proxies: