Bakeman & Quera, 2011 PDF
Bakeman & Quera, 2011 PDF
Bakeman & Quera, 2011 PDF
Roger Bakeman
Georgia State University
Vicenç Quera
Universidad de Barcelona
cambridge university press
Cambridge, New York, Melbourne, Madrid, Cape Town,
Singapore, São Paulo, Delhi, Tokyo, Mexico City
A catalog record for this publication is available from the British Library.
Cambridge University Press has no responsibility for the persistence or accuracy of urls
for external or third-party Internet Web sites referred to in this publication and does not
guarantee that any content on such Web sites is, or will remain, accurate or appropriate.
Contents
v
vi Contents
Selected-Interval Recording 34
Live Observation versus Recorded Behavior 35
Digital Recording and Computer-Assisted Coding 37
Summary 40
Epilogue 163
Appendix A: Expected Values for Kappa Comparing Two Observers 165
Appendix B: Expected Values for Kappa Comparing with a
â•… Gold Standard 167
References 169
Index 179
Figures
ix
x Figures
We wrote this book because it’s time. The TLA (three-letter acronym) for
because it’s time is BIT, and what used to be called the bit-net (now the
Internet) let the authors begin their long-distance collaboration between
Atlanta and Barcelona. When we began working together in the early 1990s,
many investigators believed€– with some justification€– that observational
methods were appealing but too expensive and too time-consuming. At
that time, analog video recording on tape had replaced film, and electronic
means of recording observational data were replacing paper and pencil; yet
most electronic and computer systems were specialized, expensive, and a
bit cumbersome. We knew the digital revolution had begun, but we had no
idea it would have the reach and impact it has today.
As we begin the second decade of this century, times have indeed changed.
We now live in an image-saturated world where no moment seems private
and everything seems available for instant download. Thus it is no wonder
that researchers increasingly see merit in digitally recording behavior for sub-
sequent systematic observation. Indeed, for recording behavior, digital has
become the standard and preferred method. And although the systematic
observation of the sort described in this book can still be done live, it works
far better when behavior is digitally recorded for later replay, reflection, and
review. Digital multimedia (audio-video) files can be created, copied, played,
and stored with relative ease€– and increasingly at minimal expense.
Coding behavior for subsequent quantitative analysis has likewise been
transformed by the digital revolution. Computer-assisted coding programs
remove much of the tedium and potential for error from the coding task€–
and can even make coding fun. Once such programs were a bit exotic, few in
number, and required relatively expensive equipment. Now€– given digital
multimedia files€– such programs are easier to implement, and the kind of
computer capability they require has become ubiquitous and inexpensive.
xiii
xiv Preface
As a consequence, users have more choices than formerly, and some soft-
ware has become less expensive or even free.
Spurred by the advent of digital recording and coding and by their
greater ease and accessibility, we think it is time to revisit matters first dis-
cussed in our 1995 book, Analyzing Interaction: Sequential Analysis with
SDIS and GSEQ. In the early 1990s€– recognizing the power of standard
formats such as those underlying almost everything the Internet touches€–
we defined a standard set of conventions for sequential observational
data: the Sequential Data Interchange Standard, or SDIS. We then wrote
a general-purpose computer program for analyzing sequential observa-
tional data that relied on those standards: the General Sequential Querier,
or GSEQ. Our 1995 book had described how to run this program in the
dominant computer system of the day; that system (the Disk Operating
System, or DOS) is now essentially extinct, and the book is out of print.
GSEQ, however, has now been updated to run in the Windows environ-
ment (the current version is available at www.gsu.edu/~psyrab/gseq or
www.ub.edu/gcai/gseq).
The present book differs from our 1995 book in several ways. Primarily,
it is greatly expanded in scope: it focuses on observational methods gen-
erally and is not confined to the details of GSEQ. It also offers consid-
erable practical advice regarding sequential analysis and data analytic
strategies for sequential observational data€– advice that applies whether
or not GSEQ is used. At the same time, we have striven to write a relatively
brief and nonencyclopedic book that is characterized by straightforward,
reader-friendly prose. Here, the interested reader may still learn how to
use GSEQ effectively with sequential observational data, if desired, but
should also be able to gain a sound conceptual overview of observational
Â�methods€– a view grounded in the contemporary digital€world.
It is the grounding in the digital world and its explication of GSEQ cap-
abilities that most distinguishes this volume from the book Roger Bakeman
wrote with John Gottman, Observing Interaction: An Introduction to
Sequential Analysis (1st ed. 1986, 2nd ed. 1997). Granted some conceptual
overlap, the topics covered in the two volumes are sufficiently different that
Observing Interaction can easily be read with profit as a companion to this
one. Certainly the intended audience is the same.
The audience we have in mind consists of behavioral and social science
researchers, of whatever level, who think observational methods might be
useful and who want to know more about them, or who have some famil-
iarity with observational methods and want to further hone their skills and
understanding. Apart from an interest in behavioral research, we assume
Preface xv
that readers of this volume will be familiar with research methods and stat-
istical analysis, at least at the level presented in introductory courses in these
topics. Such knowledge may not be needed for the first chapter€– which is
intended as a basic introduction to observational methods generally (and
which more knowledgeable readers may skim)€– but is required for subse-
quent chapters.
As with our 1995 book, many people have helped us in our task. One
author Roger Bakeman (RB) recognizes the debt owed his graduate school
advisor, Robert L. Helmreich, who first encouraged him to learn more about
observational methods, and his debt to Gene P. Sackett, who introduced him
to sequential analysis. For RB, those interests were honed in collaborative
work at Georgia State University, beginning first in the 1970s with Josephine
V. Brown, a lifelong friend; and continuing since the 1980s with Lauren
B. Adamson, an invaluable friend, supporter, and research partner. More
recently, Augusto Gnisci of the Second University of Naples and Eugene H.
Buder and D. Kimbrough Oller of the University of Memphis have helped
us improve GSEQ, our computer program for sequential analysis. Eugene
H. Buder also offered many thoughtful and cogent suggestions for improv-
ing an earlier draft; we appreciate his contribution to the clarity of the final
volume, while taking responsibility for any murkiness that remains. The
other author Vicenç Quera (VQ) recognizes the debt owed the late Jordi
Sabater-Pi, who transmitted his enthusiasm for naturalistic research to VQ
and first taught him how to observe and analyze behavior systematically;
and his debt to his early mentor, colleague, and friend, Rafael López-Feal,
who supported and encouraged his teaching and research. RB would also
like to acknowledge Maria Teresa Anguera, who translated Bakeman and
Gottman (1986) into Spanish, invited RB to speak at the University of
Barcelona in 1991, and introduced us. Our collaboration began immedi-
ately and has now continued through almost two decades.
As is always the case, colleagues and friends€– too many to mention€–
have contributed to our thinking and work over the years. RB would like to
thank, in particular, Daryl W. Nenstiel, who€– in addition to being a lifelong
critic and partner€– attempted to improve the prose of the current volume
(any remaining flaws, of course, remain ours), and Kenneth D. Clark, who
manages to keep RB on target and humble. VQ would like to thank Esther
Estany, who from time to time manages to extract him from writing papers
and computer code to visit distant deserts and other exotic regions, and to
his colleagues from the Adaptive Behavior and Interaction Research Group
at the University of Barcelona for sharing good and bad academic times and
for their irreplaceable friendship and collaboration.
1
1
2 Sequential Analysis and Observational Methods
and informed narratives have a long and important history in such fields as
history, journalism, and anthropology, and what are usually called qualitative
methods have contributed to a number of fields in the behavioral sciences
(see Cooper et€al., 2012). Moreover, as we describe in the next chapter, quali-
tative methods do play a role when developing the coding schemes used for
systematic observation. For example, Marjorie Shostak’s Nisa: The Life and
Words of a !Kung Woman (1981) provides an excellent example of qualitative
methods at work. In it she organizes interviews around such themes as earli-
est memories, discovering sex, first birth, and motherhood and loss; and she
provides texture, nuance, and insight that would largely elude quantitative
approaches. Another classic example is Barker and Wright’s (1951) One Boy’s
Day: A Specimen Record of Behavior, which provides intimate and poignant
minute-by-minute, morning-to-night observations of one boy’s life during a
single mid-twentieth-century Kansas day.
In contrast, as we understand the term, observational methods for behav-
ior are unabashedly quantitative. They provide measurement. Measurement
is usually understood as the act of assigning numbers or labels to things
(Walford, Tucker, & Viswanathan, 2010). In principle, the thing measured
could be any discrete behavioral entity. In observational practice, that entity
is typically an event or a time interval within which events can occur (see
Chapter 3). As you will see in subsequent chapters, event is a key term€–
we apply it to both relatively instantaneous behaviors and behaviors that
have appreciable duration. Some authors€– for example, Altmann (1974)€–
reserve the term for more momentary behaviors and use state for behaviors
of greater duration.
Measurement implies a measuring instrument: A thermometer gauges a
person’s temperature, a scale a person’s weight. For systematic observation
of behavior, the measuring instrument consists of coding schemes€– which
we discuss at length in the next chapter€– used by trained observers. As you
will see, unlike more familiar measuring devices, coding schemes are more
conceptual. They are based on mental distinctions and not on physical
materials like thermometers and rulers, and they involve a human compo-
nent (i.e., the observers). Melvin Konner’s work (e.g., 1976) with Harvard’s
study of the !Kung in Botswana in the late 1960s and early 1970s provides an
example. An electronic device delivered a click to his ear every 15 �seconds.
He then recorded which of several mother, infant, adult, and child behav-
iors defined by his coding scheme had occurred since the last click. One
result of his work was a quantitative description of how often others in the
environment (e.g., mothers, fathers, other adults, siblings, other children)
paid attention to and played with !Kung infants.
Introduction to Observational Methods 3
correlational. Perhaps for this reason, students sometimes think that obser-
vational methods are inherently correlational, but this is not so. True, many
experimental studies are performed in laboratories and behavioral observa-
tions are often employed in field settings not involving manipulation. But
correlational studies can also be performed in laboratories and experimen-
tal ones in the field; and behavioral observations can be employed for either
type of study in either setting. It is the design that makes a study correl-
ational or experimental, not the measurement technique used.
a few minutes to several hours. More typically, sessions are recorded, which
can absorb even more time as observers spend hours coding just a few min-
utes of behavior. Compared to the few items of a typical self-report meas-
ure, data collected from observation can be voluminous and their analysis
seemingly intractable. Why then would an investigator bother with such a
time-consuming method?
their work, awareness of the cameras seemingly disappeared within the first
several minutes of their two- to three-week stay in the habitat.
The third reason is that when investigators are interested in process€– how
things work and not just outcomes€– observational methods have the ability
to capture behavior unfolding in time (which is essential to understanding
process) in a way that more static measures do not. An important feature of
behavior is its functionality: What happens before? What next? Which are
causes and which consequences? Are there lagged effects between certain
behaviors? Only by studying behavior as a process can investigators address
such questions. A good example is Gottman’s work on marital interaction
(1979), which, based on characterizations of moment-to-moment inter-
action sequences, predicted whether relationships would dissolve or not.
Also, process questions almost always concern contingency. For example,
when nurses reassure children undergoing a painful procedure, is the chil-
dren’s distress lessened? Or, when children are distressed, do nurses reassure
them more? In fact, contingency analyses designed to answer questions like
these may be one of the more common and useful applications of observa-
tional methods (for details, see Chapters 9 and 11).
Code Definition
Unoccupied Child not engaged with anything specific;
seems aimless.
Onlooker Child watches other children playing, but
does not join in.
Solitary or Child plays alone and independently,
independent play seemingly unaffected by others.
Parallel activity Child plays independently beside, but not
with, other children but with similar toys; no
attempt to control who is in the group.
Associative play Child plays with other children, with some
sharing of play materials and mild attempts
to control who is in the group.
Cooperative play Child plays in a group that is organized for
some purpose, for example, playing house
or a formal game or to attain a goal.
Bakeman &
Parten (1932) Smith (1978) Brownlee (1980)
Together
Unoccupied
Alone Unoccupied
Onlooker
Solitary Solitary
Parallel Parallel Parallel
Associative
Group Group
Cooperative
Figure 1.2.╇ The evolution of three similar coding schemes for social participation
as discussed in the text (adapted from Bakeman & Gottman, 1997).
and then compared the actual counts to those expected by chance, based
on how often each type of play state occurred. Of particular interest was
the Parallel-to-Group transition, which Bakeman and Brownlee thought
should be especially frequent if parallel play serves as a bridge to group
play. By chance alone, values for this transition should exceed their chance
values for half of the children. In fact, observed values for the Parallel-to-
Group transition exceeded chance for thirty-two of the forty-two children
(p < .01 per two-tailed sign test; see the section titled “The sign test: A
non-parametric alternative” in Chapter 11). Given this result, Bakeman
and Brownlee concluded that movement from parallel to group play may
be more a matter of moments than of months, and that parallel play may
indeed serve as a bridge€– a brief interlude during which young children
“size up” those to whom they are proximal before deciding whether to
become involved in the group activity.
This sequence of three studies helps define what we mean, not just by
observational methods but by sequential analysis. All employed observa-
tional measurement€– that is, observers applied predefined coding schemes
to events observed either live or recorded. And all used the categorical data
recorded initially to compute various summary scores, which were then
analyzed further. Parten used her cross-sectional data to suggest develop-
mental progressions over years, Smith used his longitudinal data to suggest
developmental progressions over months, but of the three, only Bakeman
and Brownlee used continuously recorded data to suggest how behavior was
sequenced over moments. True, the sequential analytic methods described
in this book can be applied at larger time scales (as Smith demonstrates),
but almost always the time scales we have in mind involve seconds and
minutes€– only rarely days, months, or years€– because the more immediate
time scales seem appropriate for studying social processes that happen in
the moment.
summary
The topic of this book is systematic observation€– observational methods,
generally€– which is one of several useful methods for measuring behavior.
We say systematic to distinguish quantitative observational methods from
other approaches to understanding behavior that rely less on quantifica-
tion and more on qualitative description. Quantification is provided by
measurement, which usually is understood as the act of assigning numbers
or labels to things. Consequently, coding schemes, as detailed in the next
chapter, are central to observational methods. Measurement requires an
12 Sequential Analysis and Observational Methods
13
14 Sequential Analysis and Observational Methods
behavior of interest repeatedly (video recordings help greatly) with the eye
of a qualitative researcher. Try to identify themes and name codes. Then try
to imagine how analysis of these codes will help you later when you attempt
to answer your research questions.
In any case, developing coding schemes is necessarily an iterative pro-
cess, a matter of repeated trial and error and successive refinement. Whether
you begin with coding schemes others have used or start de novo, you will
pilot-test each successive version against examples of behavior (here video
recordings help greatly). Such checking may reveal that codes that seemed
important initially simply do not occur, so you will remove them from your
list. It may also reveal that distinctions that seemed important theoretically
cannot be made reliably, in which case you will probably define a single,
lumped code that avoids the unreliable distinction. Or you may find ini-
tial codes that lump too much and miss important distinctions, in which
case you will split the initial codes and define new, more fine-grained ones.
Expect to spend hours and weeks; shortchanging the development of the
measuring instruments on which a research project rests can be perilous.
Figure 2.1.╇ Three coding schemes; each consists of a set of mutually exclusive and
exhaustive codes (see text for citations).
when caregiver and infant are both engaged with the same object; but
Supported is coded when the infant shows no awareness of the caregiver’s
engagement, whereas Coordinated is coded when the infant does show such
awareness, often with glances to the caregiver.
Each of these three coding schemes consists of a set of mutually exclu-
sive and exhaustive (ME&E) codes; that is, for every entity coded there
is one code in the set that applies (exhaustive), but only one (mutually
exclusive). These are desirable and easily achieved properties of coding
schemes. Organizing codes into ME&E sets often helps clarify our codes
when we are first defining and developing them and almost always simpli-
fies and facilitates subsequent recording and analysis of those codes. But
often research questions concern co-occurrence. So what should we do
with codes that can co-occur, like mother holding infant and mother look-
ing at infant?
Co-occurrence can be addressed and mutual exclusivity of codes within
sets achieved in one of two ways. First, when sets of codes are not mutu-
ally exclusive by definition, any set of codes can be made mutually exclu-
sive by defining new codes as combinations of existing codes. For example,
imagine that two codes are mother looks at infant and infant looks at mother.
Their co-occurrence could be defined as a third code, mutual gaze. Second,
codes that can co-occur can be assigned to different sets of codes, each of
which is mutually exclusive in itself. As a general rule, the coding schemes
defined for a given research project work best€– from coding, to recording,
to analysis€– when they consist of sets of coding schemes: Each set describes
an aspect of interest (e.g., mother’s gaze behavior, infant’s motor behavior).
For this example, two sets of codes could be defined. The first concerns
mother’s gaze and includes the code mother looks at infant, whereas the
second concerns infant’s gaze and includes the code infant looks at mother.
Mutual gaze, instead of being an explicit code, could then be determined
later analytically.
16 Sequential Analysis and Observational Methods
Figure 2.2.╇ A coding scheme consisting of two sets of mutually exclusive and
exhaustive codes (Cote et al., 2008).
bringing cup to doll’s mouth) was assigned the interval. Thus the hierarch-
ical rule makes the codes mutually exclusive.
If codes are as simple as mother and infant looking at each other, a four-
code scheme with a combination and a nil code might be fine; but when
more looking categories are considered, two separate ME&E schemes prob-
ably make more sense. Marc Bornstein’s work provides an example. He and
his colleagues were interested in documenting cultural variations in person-
and object-directed interaction (e.g., Cote, Bornstein, Haynes, & Bakeman,
2008). Observers coded continuously from video recordings using the two
ME&E sets of codes shown in Figure 2.2 and a computer-assisted system
that recorded times. Indices of contingency were then computed using the
computer program and procedures described later (see “Indices of associ-
ation for two-dimensional tables” in Chapter 9). Analysis of these indices
led Cote and colleagues to conclude that mothers were significantly more
responsive than infants to their partner’s person-directed behavior in each
of the three cultural groups studied, but that European American mothers
were significantly more responsive to their infants’ person-directed behav-
ior than Latina immigrant mothers, while neither group differed from non-
immigrant Latina mothers.
Should all your codes be segregated into ME&E sets, each represent-
ing some coherent dimension important for your research as illustrated in
Figure€2.2? The answer is: Not always. Imagine, for example, that your codes
identify five types of events that are central to your research questions and
that any and all can co-occur. Should you define five sets, each with two
codes€– the behavior of interest and its absence? Or would it simply be better
to list the five codes and ask observers to note their occurrence? If you wanted
duration information, you could record onset and offset times for each.
Either strategy offers the same analytic options; thus, which you choose is a
matter of personal preference. As with the fewer-versus-more combination
codes question in an earlier paragraph, a good rule is to choose whichever
18 Sequential Analysis and Observational Methods
Figure 2.3.╇ Codes for chimpanzee mother and infant food transfer (for definitions
see Ueno & Matsuzawa, 2004).
strategy your observers find easier to work with (and can do reliably). Thus
a brief answer to the question posed by this section (Must codes be mutually
exclusive and exhaustive?) might be: Often yes, but not necessarily.
Even when codes are mutually exclusive, breaking them into smaller sets
can simplify coding. For example, when coding similar, mutually exclusive
actions by different actors, we could define a separate code for each actor-
action combination (MomLooks, KidLooks, SibLooks, etc.), but it is simpler
to define two sets of code, one for actors (Mom, Kid, Sib, etc.) and another
for actions (Looks, Talks, etc.). This results in fewer, less redundant codes
overall. And even when different actors perform different actions, organ-
izing them into separate sets still has the advantage of focusing observer
attention first on one actor and then on the other. The codes in the separ-
ate sets could be, but do not need to be, ME&E. An example is provided in
Ueno and Matsuzawa’s (2004) study of food transfer between chimpanzee
mothers and infants (see Figure 2.3). True, several of the thirteen infant
codes might co-occur never or rarely, but they are not designed to be mutu-
ally exclusive (although they could be with a hierarchical rule); likewise
with the eight mother codes. Here is an example when the answer to the
question of whether codes must be ME&E is: Not necessarily.
broad strokes. They can vary from micro to macro (or molecular to molar)€–
from detailed and fine-grained to relatively broad and coarse-grained. The
appropriate level of granularity for a research project depends on the ques-
tions addressed. For example, if you are more interested in moment-to-
moment changes in expressed emotion than in global emotional state, you
might opt to use a fine-grained scheme like the facial action coding scheme
developed by Paul Ekman (Ekman & Friesen, 1978), which relates different
facial movements to their underlying muscles.
One useful guideline is that, when in doubt, you should define codes at a
somewhat finer level of granularity than your research questions require (i.e.,
when in doubt, split, do not lump). You can always analytically lump later,
but to state the obvious, you cannot recover distinctions never made (Suomi,
1979, pp. 121–122). Another useful guideline is that your codes’ granularity
should be roughly similar. Usually questions that guide a research project
are of a piece and require either more molar or more molecular codes. Mix
levels only if your research questions clearly require it.
Figure 2.4.╇ Examples of coding schemes, one more physically based (infant; Oller,
2000) and one more socially based (maternal; Gros-Louis et al., 2006).
code might be infant crying; another example of a more socially based code
might be child engaged in cooperative play. Some ethologists and behavior-
ists might regard the former as objective and the latter subjective (and so
less scientific), but the physically based–socially based distinction may mat-
ter primarily when selecting and training observers. Are observers detec-
tors of things “really” there? Or are they more like cultural informants,
able through experience to “see” the distinctions embodied in our coding
schemes? Perhaps the most important consideration is whether observers
can be trained to apply coding schemes reliably (see Chapters 5 and 6)€– no
matter how concrete the coding scheme.
Ekman’s facial action coding system (Ekman & Friesen, 1978), cited in
the previous section as an example of a molecular approach, also provides
an excellent example of a concrete coding scheme. Yet another example of a
concrete coding scheme is provided by Oller (2000), who categorizes young
infants’ vocalization as shown in Figure 2.4 (left). Oller and his colleagues
provide precise, acoustically based definitions that distinguish between, for
example, quasi-resonant vowel-like vocalizations, fully resonant �vowels,
marginal syllables, and canonical syllables; and like Belsky and Most’s
(1981) infant object play codes, Oller’s first five codes describe a develop-
mental progression.
Both Ekman’s and Oller’s coding schemes are sufficiently concrete that
we can imagine their coding might be automated. Computer coding€– dis-
pensing with human observers€– has tantalized investigators for some time,
but remains mainly out of reach. True, computer scientists are attempting
to automate the process€– and some success has been achieved with auto-
matic computer detection of Ekman-like facial action patterns (Cohn€ &
Kanade, 2007; Cohn & Sayette, 2010; Messinger, Mahoor, Chow, & Cohn,
2009)€ – but it still seems that as codes become more socially based, any
kind of computer automation becomes more elusive. As an example,
consider the coding scheme in Figure 2.4 (right) used to code maternal
Coding Schemes and Observational Measurement 21
are more tied to visible behavior. True, the items rated could be quite con-
crete, but given the nature of rating scales, they are more likely to be socially
based€– for example, asking an observer to rate how happy a child was, 1
to 7, during a 1-minute period of interaction instead of coding how often a
child smiled or laughed during the same interval. Second, rating can be less
time-consuming than coding. Comparable codes and ratings may remain
at the same level of granularity, but the entities rated (e.g., 1-minute inter-
vals or whole session) can be longer than the entity coded (events)€– which
requires fewer judgments from the observers and hence less time.
An example is provided by the work of Adamson and her colleagues.
First, observers coded video records of 108 children during structured-play
sessions (Adamson et al., 2004). The coding scheme was an extension of
the engagement scheme shown in Figure 2.1 that€– as appropriate for chil-
dren learning language€– included a symbol-infused joint engagement code
(joint engagement that incorporates use of symbols such as words). Second,
observers rated 1 to 7 the amount of total joint engagement and the amount
of supported, coordinated, and symbol-infused joint engagement in the six
5-minute segments that constituted the session. Mean ratings correlated
strongly with percentages derived from the coded data (.75, .67, .86, and
.89 for total, supported, coordinated, and symbol-infused joint engagement,
respectively), but rating took considerably less time than coding. Their
comparison demonstrates that when process questions are not at issue and
summary measures of outcome are sufficient to address research questions,
rating€– instead of coding€– the same behavior can provide the same quality
results with much less effort and for that reason is well worth considering.
with more elaborate and extended definitions that stress similarities and
differences between the particular code and other codes with which it could
be confused. Examples of the behavior to which the codes apply are help-
ful and might consist of verbal descriptions, pictures or other graphic aids,
sound or video recordings, or some combination of these.
The coding manual also explains any special coding rules (e.g., only
engagement states that last at least 3 seconds are to be coded). Like devel-
oping a coding scheme, drafting a coding manual is an iterative process.
Ideally the two processes occur in tandem; the coding manual should be
drafted as the coding schemes are evolving. Once completed€ – and with
the understanding that additional refinements remain a possibility€ – the
coding manual stands as a reference for training new coders and an anchor
against observer drift (i.e., any change over time in observers’ implicit defi-
nitions of codes). It also documents procedures in ways that can be shared
with other researchers. For all these reasons the coding manual is central
and essential to observational research and deserves considerable care and
attention. Published reports should routinely note that copies are avail-
able on request. (For further comments concerning coding manuals, see
Yoder€& Symons,€2010, chapter 3.)
Research articles often abstract the coding manual, in whole or in part€–
which is helpful, if not essential, for readers. For example, Dian Fossey
(1972) provided extensive definitions for nine types of mountain gorilla
vocalizations. Three of her definitions are given in Figure 2.5 and show the
sort of detailed definition desirable in a coding manual.
Codes€– that is, the names or labels you select to identify your codes€–
have many uses. They appear in coding manuals, may be entered on data-
recording forms or used by computer-assisted coding programs, and appear
again as variable names in spreadsheet programs and statistical packages.
It is worth forming them with care. They should be consistent; for example,
do not use jt as an abbreviation for joint in one code and jnt in another. As
a general rule, briefer, mnemonic names are better; codes longer than eight
or so characters can clutter the screen or page. Not all programs treat codes
as case-sensitive, but using upper- and lowercase letters is often helpful.
Underscores distract the eye and are best avoided (they are a holdover from
a time when many computer programs did not allow embedded blanks
in, e.g., file names). Thus MomLook is a better choice than Mom_Look;
both use upper- and lowercase letters, but the first is shorter and avoids an
underscore. Names are a matter of taste, so some variability is expected.
Nonetheless, a good test is whether codes are readily understood by others
at first glance or seem dense and idiosyncratic.
24 Sequential Analysis and Observational Methods
Code Definition
summary
Coding schemes are the primary measuring instrument of observational
methods. Like the lens in telescopes and microscopes, they both limit and
focus observers’ attention. Coding schemes consist of lists of names or cat-
egories (or, less often, ratings) that observers then assign to the observed
behavior. Often codes are organized into mutually exclusive (only one code
applies to each entity coded) and exhaustive (some code applies to every
entity coded) sets (ME&E)€– a practice we recommend because of the clar-
ity it provides.
Coding schemes can be adapted from others with similar theoretical con-
cerns and assumptions or developed de novo; in either case, development
is necessarily an iterative process and well worth the time it takes. Coding
schemes necessarily reflect underlying theoretical assumptions and clarity
Coding Schemes and Observational Measurement 25
results when links between theory and codes are made explicit. Several
examples of coding schemes have been given throughout this chapter. They
varied in degree of granularity from fairly molecular to quite molar, from
finer-grained to coarser-grained. They also varied in degree of concreteness
from clearly physically based to socially based codes for which observers
can be regarded more as social informants than simple detectors.
The coding manual is the end product of coding scheme development
and an essential component of observational research. It provides defini-
tions and examples for codes, details how coding schemes are organized,
and explains various coding rules. It documents procedures, is essential for
observer training, and can be shared with other researchers who want to
replicate or adapt similar procedures.
3
26
Recording Observational Data 27
Duration No Untimed-event
Behavioral event
recorded? Yes Timed-event
Intervals Yes Interval
Time interval
contiguous? No Selected-interval
recording when they are not. These four strategies (see Figure 3.1) are dis-
cussed in the next four sections, but our primary focus is on the first three.
In the remainder of the book, we explore the first three (untimed-event,
timed-event, and interval recording) further because, granted assumptions,
they produce data appropriate for sequential analysis.
Classifications are not so much perfect as useful, and we find that this
division of recording strategies into four kinds generally seems to describe
what investigators do. It also agrees reasonably with other authors, who
may use somewhat different terms but generally make the same distinc-
tions. For example, Martin and Bateson (2007) define two basic types of
recording rules: continuous recording and time sampling. Their continuous
recording€ – like our timed-event recording€ – “aims to provide an exact
and faithful record of the behavior, measuring true frequencies and dura-
tions”; and for their time sampling (a term with a long history in the field, as
detailed in the “Interval recording” section later in this chapter)€– as for our
interval recording€– “the observation session is divided up into successive,
short periods of time” (p. 51). Similarly, Yoder and Symons (2010) define
three kinds of behavior sampling: continuous behavior sampling, intermit-
tent behavior sampling, and interval sampling. Their continuous behavior
sampling€– like our event recording€– is divided into two types: untimed-
event and timed-event. Their interval sampling is the same as our interval
recording. And their intermittent sampling is similar to our selected-inter-
val recording.
Another example where authors make similar distinctions but use differ-
ent terms was given in Chapter 1. We use the term event generally, both for
behaviors that are relatively instantaneous and those that have appreciable
duration€ – but then use the term momentary event to identify relatively
brief behaviors for which duration is not of interest (see “Timed-event and
state sequences” in Chapter 4), whereas Altmann (1974) reserved the term
event for what we call momentary events and the term state for behaviors
that have appreciable duration.
28 Sequential Analysis and Observational Methods
untimed-event recording
Detecting events as they occur in the stream of behavior and coding one or
more of their aspects, but not recording their duration€– which is what we
usually mean by untimed-event recording€– seems deceptively simple, but
still places demands on the observer. In such cases, observers are asked not
just to code the events, but to detect them€– which requires that they be con-
tinuously alert, note when a new event first begins and then ends, and only
then assign a code (or codes) to it. If the events to be coded were demar-
cated beforehand (e.g., as utterances or turns-of-talk in a transcript), the
observer’s task would be less demanding, but such cases are relatively€rare.
One merit of untimed-event recording is how well it works with mate-
rials no more complex than paper and pencil. A lined paper tablet, with
columns representing codes and rows successive events, can be helpful;
see Figure 3.2 for an example of what an untimed-event recording form
could look like. When recording untimed events, codes are necessarily
organized as one or more ME&E sets. The sample form shows two sets:
the first codes statement type as command, declaration, or question, and
the second codes whether the statement was accompanied by a look, a ges-
ture, both, or neither. Each successive event is checked for one of the state-
ment types and for one of the look-gesture codes. For this example, events
are cross-classified by statement and gesture-look and their counts could
be summarized with a contingency table. With only one ME&E set, the
data would be represented with a single string of codes€– a format that is
both common in the literature and that has attracted considerable analytic
attention as we will see in subsequent chapters. For example, the sequence
for the type of statement codes in Figure 3.2 would be: declr, declr, quest,
comnd, comnd, and quest.
Untimed-event recording is simple and inexpensive to implement. The
disadvantage is that knowing only the sequence of events, but not their dur-
ation, limits the kinds of information that can be derived from the data.
You can report how many times each code was assigned, the proportion of
events for each code, how the codes were sequenced, and how many times
codes from different ME&E sets co-occurred if more than one ME&E set
was used, but you cannot report much more€– although you could report
rates (i.e., how often each code was used per unit of time) if you know the
duration of the session (either because you arbitrarily limited it to a fixed
length or recorded its start and stop times). However, if your research
questions require nothing more than information about frequency and
sequence€– and possibly cross-classification€– then the simplicity and low
Recording Observational Data 29
Figure 3.2.╇ A paper form for untimed-event recording with two sets of ME&E
codes.
timed-event recording
Often duration matters. You may want to know, for example, not just how
often the mother soothed her infant (i.e., how many times), but the percent-
age of time she spent soothing during the session, or even the percentage
of time her infant cried during the time she spent soothing. In such cases,
you need to record not just that events occurred, but how long they lasted.
In general, this is the approach we recommend because of the richness of
the data collected and the analytic options such data afford. As we noted
earlier in this chapter, Martin and Bateson (2007, p. 51) wrote that timed-
event recording provides “an exact and faithful record” of behavior. Not sur-
prisingly, increased options have their price. Recording duration, usually by
noting the onset and offset times for events, increases the burden on obser-
vers, requires more expensive and complex recording equipment, or both€–
and typically there is a trade-off involved. True, the burden on observers
decreases substantially with computer-assisted recording systems, but such
systems take more to acquire and maintain than paper and pencil.
Timed-event recording does not absolutely require computer technology,
but works best with it. Nor does it absolutely require that observers work with
30 Sequential Analysis and Observational Methods
Mom Kid
Event Onset time Offset time code code Comment
1
2
3
4
5
6
…
interval recording
Like untimed-event recording, interval recording is relatively easy and inex-
pensive to implement and, perhaps for this reason, has been much used in the
Recording Observational Data 31
past. At the same time, it is associated with a more complex terminology and
series of choices than either untimed-event or timed-event recording and,
beginning in the 1920s, has spawned a rather large, sometimes arcane litera-
ture that seems increasingly dated. Typically it is referred to as time sampling
(e.g., Arrington, 1943; Hutt & Hutt, 1970). It has often seemed something
of a compromise; it is easier to implement but less exact than timed-event
recording. As Martin and Bateson (2007) wrote, “Less information is pre-
served and an exact record of behavior is not necessarily obtained” (p. 51).
And even though 34 percent of all articles published in Child Development in
the 1980s used interval recording (Mann, Have, Plunkett, & Meisels, 1991;
cited in Yoder & Symons, 2010), it is our belief that if timed-event recording
had been easily available, many of those investigators would have preferred
it (certainly this was true for Bakeman & Brownlee, 1980).
The essence of interval recording is this: The stream of behavior is seg-
mented into relatively brief, fixed time intervals (e.g., 10 or 15 seconds in
duration), and one or more codes are assigned to each successive inter-
val. Unlike untimed-event and timed-event recording, which require that
observers pay attention to events and their boundaries, interval recording
requires that observers pay attention to time boundaries, which could be
noted with simple clicks (like a metronome, only slower). Konner’s (1976)
study of the !Kung cited in Chapter 1 is an example. Recall that an electronic
device delivered a click to his ear every 15 seconds and that he then checked
which of several mother, infant, adult, and child behaviors had occurred
since the last click. As this example suggests, interval recording can be
effected with simple and inexpensive materials. Some timing device that
demarcates intervals is needed. Otherwise, paper and pencil suffice. Lined
paper tablets are helpful€– each row can represent a successive time inter-
val and each column a specific code. Figure 3.4 gives an example of what
an interval recording form might look like. The codes used as examples
here are selected from a re-analysis of Konner’s data (Bakeman, Adamson,
Konner, & Barr, 1990) and are defined as infant manipulates object, explores
object, relates two objects, and vocalizes; and mother vocalizes to, encourages,
and entertains her infant.
However, interval recording (or time sampling, which is the more com-
mon term in the literature) is more complex than the example just pre-
sented suggests. Three kinds of sampling strategies are usually identified:
partial-interval (or one-zero), momentary (or instantaneous), and whole-
interval sampling (Powell, Martindale, & Kulp, 1975; Quera, 1990; Suen &
Ary, 1989; Yoder & Symons, 2010). The recording rules for each are some-
what different, as detailed in subsequent paragraphs.
32 Sequential Analysis and Observational Methods
Infant Mother
Whole-Interval Sampling
Whole-interval sampling is probably the least common of the three sam-
pling strategies. Its recording rule is: Check an interval only if the behavior
Recording Observational Data 33
occurred for the duration of that interval; do not check if the behavior did
not occur or occurred for only part of the interval. Like event recording, it
requires that observers be continuously alert. A variant of whole-�interval
sampling is: Check the behavior that predominated during the inter-
val (called predominant activity sampling by Hutt & Hutt, 1970)€ – which
seems similar to whole-interval sampling but gives approximations simi-
lar to those of momentary sampling (Tyler, 1979). Momentary and whole-
�interval sampling are alike in that intervals are checked for one, and only
one, code (codes are regarded as mutually exclusive by definition), whereas
with one-zero sampling, intervals may be€ – and often are€ – checked for
more than one code.
As noted earlier, the advantages of interval recording are primarily
practical; it is easy and inexpensive to implement. The disadvantage is that
summary statistics may be estimated only approximately. For example,
with partial-interval sampling, frequencies are likely underestimated (a
check can indicate more than one occurrence in an interval), proportions
are likely overestimated (a check does not mean the event occupied the
entire interval), and sequences can be muddled (if more than one code is
checked for an interval, which occurred first?). Moreover, with moment-
ary or whole-interval sampling, two successive checks could indicate either
one continuing or two separate occurrences of the behavior. There are pos-
sible fixes to these problems, but none seem completely satisfactory (see
Altmann & Wagner, 1970; Quera, 1990; Sackett, 1978).
Interval duration is a key parameter of interval recording. When the
interval duration is small relative to event durations and the gaps between
events, estimates of code frequencies, durations, and sequences will be bet-
ter and more precise (Suen & Ary, 1989). Decreasing the duration of the
intervals used for interval recording, however, increases the number of deci-
sions observers need to make and thereby loses the simplifying advantage
of interval recording. To take matters to the limit, if the interval duration
is decreased to the precision with which time is recorded (e.g., if 1-second
intervals are defined), interval-recorded data become indistinguishable
from timed-event-recorded data€– their code-unit grid becomes the same
(see “A universal code-unit grid” in Chapter 4). Better to record timed-event
data in the first place than use interval recording with too small intervals.
In sum, we recommend interval recording only if approximate estimates
are sufficient to answer your research questions and the practical advan-
tages are decisive (e.g., you have limited resources and the cost of paper and
pencil is attractive). An additional advantage of interval recording is that it
fits standard statistical models for observer agreement statistics better than
34 Sequential Analysis and Observational Methods
other methods, as we discuss in Chapter 6€– but this alone is not a good
reason for selecting this method.
selected-interval recording
The previous section has described methods for assigning codes to contigu-
ous fixed intervals€– an observational session was segmented into intervals
of a specified duration, usually relatively brief, and the intervals were then
assigned codes per the recording rules for partial, momentary, or whole
intervals. With such methods, continuity was usually assumed, the data
were regarded as sequential, and sequential data analytic techniques such
as those described in Chapter 11 could be applied.
In contrast to interval recording, what we call selected-interval recording
methods code noncontiguous intervals;€summary statistics such as those
we describe in Chapter 8€can still be computed and, when research ques-
tions do not concern contingency or require sequential data, these record-
ing methods can be useful. In fact, selected-interval recording is something
of a residual category. We mean to include in it any methods that assign
codes to noncontiguous intervals. However, we recognize that when those
intervals are equally spaced (every hour, every day, every month, etc.),
momentary sampling is an equally appropriate term; and when every n-th
interval is coded per partial- or whole-interval rules, the method remains
interval recording (which is equivalent to separating observation inter-
vals with recording intervals; see Bass & Aserlind, 1984; Rojahn & Kanoy,
1985). It is also a rather heterogeneous category, thus instead of attempting
to exhaustively describe all the many variants in the literature, we will sim-
ply give a few examples.
Generally, whenever the intent is to describe how individual animals or
humans distribute their time among different types of activities (time-bud-
get information), selected-interval recording can be an efficient approach.
For example, both Parten (1932) and Smith (1978), cited in Chapter 1,
coded isolated, noncontiguous selected intervals. Florence Goodenough
(1928) called Parten’s the method of repeated short samples. Arrington
(1943) defined time sampling as “a method of observing the behavior of
individuals or groups under ordinary conditions of everyday life in which
observations are made in a series of short periods so distributed as to afford
representative sampling of the behavior under observation” (p. 82). She
credited Olson (1929, cited in Arrington) with its first use (his observers
made a check when specified behaviors for grade-school students occurred
during a 5-minute interval€ – thus one-zero sampling) and wrote that
Recording Observational Data 35
The result is a data file in which each line represents a coded event along
with its onset and (optionally) offset time. Software programs vary in their
conventions and capabilities, but when sets of ME&E codes are defined,
many such programs automatically supply the offset time for a code when
the onset time of another code in the same set is recorded. Alternatively, off-
set times can be entered explicitly. Another useful feature, present in most
coding software, lets you apply additional codes to events already coded. For
example, after coding an event MomSpeaks, you might then want to code
its tone as Positive or Negative, note its function (e.g., Question, Demand,
Comment), and so forth.
Some software programs permit what we call post-hoc coding€– in other
words, they allow you to first detect an event and only code it afterward, once
the whole event has transpired. Compared to systems that require you to
provide a code at the onset of an event, post-hoc coding can minimize back-
and-forth playback and so speed up coding considerably. For example (with
appropriate options set), when you think an event is beginning, you would
hold down the space bar; and when it ends, you would release it, which will
pause playback. You can then decide what the code should be and enter it
with a keystroke or a mouse click. At that point, you can restart playback
and wait for the next event to begin. Alternatively, if you are segmenting the
stream of behavior with a single set of ME&E codes (e.g., Wolff ’s, 1966, infant
state or Adamson and Bakeman’s, 1984, engagement state codes, as cited in
Chapter 1), you would simply restart play by depressing the space bar after
entering a code. When that state ends, release the space bar, enter the appro-
priate code, and continue. You can always back up and replay events and edit
both times and codes, of course, but post-hoc coding offers a quite natural
and relatively quick way to segment a session into ME&E states.
Another sophisticated coding possibility we call branched-chain cod-
ing (called lexical chaining by INTERACT), which is useful if you wish to
assign multiple codes to an event. For example, Bakeman and Brownlee
(1982) asked coders to detect possession struggles€– that is, times when one
preschool child (the holder) possessed an object and another (the taker)
attempted to take it away. With appropriate software, coding could proceed
as follows: Once a possession episode is identified, coders are asked (via an
on-screen window) to select whether the taker had prior possession (had
played with the object during the previous minute, yes or no). A second
window asks whether the holder offered resistance (yes or no), and a third
whether the taker gained possession of the contested object (yes or no).
Thus coders are presented successive sets of codes; after selecting a code
from one set, they are presented with codes from the next set. The present
40 Sequential Analysis and Observational Methods
example used three sets of two codes each, but you could use as many sets
with as many codes as needed€– which makes this a very flexible approach.
It also illustrates how appropriate software can manage clerical coding
details while letting observers focus solely on the task of coding.
The appeal of branched-chain coding is that observers need focus on
only one decision (i.e., one set of codes) at a time, recording each decision
with a keystroke or mouse click. Often the set presented next is the same, no
matter the code just selected (as when cross-classifying an event on several
dimensions). However, the set of codes presented next can be determined
by the code just selected (as when coding decisions are structured hierarch-
ically in a tree diagram). For example, imagine that observers are first asked
to detect communicative acts and code them as involving speech, gesture,
or gesture-speech combination (based on Özçalışkan & Goldin-Meadow,
2009). If either gesture or gesture-speech is coded, next observers code the
type of gesture (conventional, deictic, iconic). And if gesture-speech is coded,
observers could also be asked to code the type of information conveyed by
the combination (reinforcing, disambiguating, supplementary). Again, the
ability to chain codes in this way offers impressive flexibility.
Finally, with most software for computer-assisted coding (and video-
�editing software generally), you can assemble lists of particular episodes
that can then be played sequentially, ignoring other material. Such capabil-
ities are not only useful for coding and training, but for educational and
presentation purposes as well. Still, we do not think that investigators who
require continuous timed-event recording need despair if their resources
are limited. Digital files can be played with standard and free software on
standard computers, or videotapes can be played on the usual tape playback
devices. Observers can write codes and the times they occur on a paper form
and enter the information into a computer file later, or they can enter the
information directly as they code (e.g., using a spreadsheet program run-
ning on the same or a separate computer). Times can even be written when
coding live using only paper, pencil, and a clock. Such low-tech approaches
can be tedious and error-prone€– and affordable€– but when used well, they
can produce timed-event data that are indistinguishable from that collected
with systems costing far more. More time and effort may be required, but
the end result can be the same.
summary
When setting out to record observational data, you need to select not just
appropriate materials€– which can range from simple paper and pencil to
Recording Observational Data 41
Once observers have done their work€ – that is, once their assignment of
codes to events or intervals has been committed to paper or electronic files€–
it is tempting to think that you can now move directly to analysis of the
coded data. Almost always this is premature because it bypasses two import-
ant intervening steps. The second step involves reducing sequential data for
a session into summary scores for subsequent analysis and is relatively well
understood; for details see Chapters 8 and 9. The first step is equally import-
ant but often receives less attention. It is the subject of this chapter and
involves representing€– literally, re-presenting€– the data as recorded Â�initially
into formats that facilitate subsequent data reduction and analysis.
When recording observational data, as described in the preceding chap-
ter, observer ease and accuracy are paramount, and methods and formats
for recording data appropriately accommodate these concerns. But when
progressing to data preparation, reduction, and analysis, different formats
may work better. In this chapter, we consider two levels of data represen-
tation. The first is a standard format€ – that is, a set of conventions€ – for
sequential data that defines five basic data types and reflects the recording
strategies described in the previous chapter. The second is more conceptual;
it is a way of thinking about sequential data in terms of a universal code-
unit grid that applies to all data types and that facilitates data analysis and
data modification, as demonstrated in subsequent chapters (especially in
Chapter 10).
43
44 Sequential Analysis and Observational Methods
Figure 4.1.╇ Recording strategies, data types, and coding and universal grid units;
see text for definitions.
representing time
If duration matters, even if only for an observation session and not the
events within it, then time must be recorded whether you use SDIS or some
Representing Observational Data 45
other set of conventions to represent your data. And exactly how time is
represented is not always a simple matter. It can be more complicated than
simply using integers to count, for example, the number of camels or using
real numbers to gauge the weight of a camel in pounds represented with a
fractional component.
Our conventional way of representing time (60 seconds to a minute, 60
minutes to an hour, 24 hours to a day) goes back to Babylon and before.
Contemporary digital timekeeping devices represent factional parts of a
second with digits after the decimal point. Visual recordings provide a new
wrinkle€– moving images are represented by a series of still frames, with the
number of frames per second varying according to the standards used by
the recording device. One common time format used for visual recordings
is hh:mm:ss:ff, where hh is the number of hours (0–24), mm the number
of minutes (0–59), ss the number of seconds (0–59), and ff the number of
frames (0–30 for the NTSC standard used in the United States and 0–25 for
the PAL standard used in much of Europe), although exactly what a frame
means becomes less clear for digital recording.
For historical reasons, hh:mm:ss, mm:ss, and ss are all reasonable ways
to represent time, and in fact most computer systems accommodate any
of these formats. Given different standards for number of frames per
second, it makes sense to convert frames to fractional parts of a second,
thus replacing hh:mm:ss:ff with hh:mm:ss.d… as a standard format for time,
where d is a digit indicating a fractional part of a second. Then the question
becomes how many digits to use after the decimal. For most observational
coding, we would argue for no more than two€– unless specialized equip-
ment is used that records many more frames per second than is standard.
When recording live, human reaction time typically averages 0.3 second,
and the precision of video recording is limited by the number of frames per
second€– which is approximately 0.033–0.40 second per frame for NTSC
and PAL standards, respectively. Given these considerations, claiming a
tenth of a second accuracy seem reasonable, a hundredth of a second accur-
acy dubious, and any greater accuracy futile.
However, for many behavioral research questions, accuracy to the near-
est second is sufficient, and for that reason we often recommend rounding
all times to the nearest second in the first place. Although some computer
programs may display multiple digits after the decimal point (three digits is
fairly common), there is no reason for you to take them seriously€– unless,
as noted, you have specialized equipment and concerns (e.g., phonetic-level
coding of the acoustic signal). The SDIS compiler included in GSEQ does
allow the hh:mm:ss, mm:ss, and ss formats to be followed by a decimal point
46 Sequential Analysis and Observational Methods
with one, two, or three digits, but GSEQ also includes a utility for rounding
those times if you think less precision is more reasonable.
To avoid possible confusion, exclusive and inclusive offset times should
be distinguished. Time units are considered discrete by the SDIS compiler
and GSEQ. As a result€– but also as you would expect€– duration is com-
puted by subtracting an event’s onset time from its offset time. For example,
if the onset for KidTalks is 02:43:05 and its offset is 02:43:09, then€– because
the offset time is assumed to be exclusive€– its duration is 4 seconds. For
this example, the inclusive offset time would be 02:43:08. Unless explicitly
stated otherwise, it is usually safe to assume that offset times are exclusive.
If we always said 5 to 9 (exclusive) and 5 through 8 (inclusive), this might be
clear enough, but often to and through are used interchangeably in everyday
English, which loses the exclusive-inclusive distinction. Some languages,
like Spanish, lack the to-through distinction. The safest course is always to
say either inclusive or exclusive, whichever is appropriate.
<Case #1>
declr
declr
quest
comnd
comnd
quest … /
<Case #2>
,02:57:12 quest declr declr comnd … ,03:02:28/
<Case #3>
…/
Figure 4.2.╇ An example of an SDIS single-code event sequential data file; % indi-
cates a comment. Codes may be listed one (Case #1) or more (Case #2) per line, as
desired. Session start and stop times may be included (as for Case #2) but are not
required. See text for other details.
that begins with a percent sign (%) is treated as a comment and otherwise
ignored; comments enclosed in percent signs may also appear anywhere in a
line (% is the default comment character; it can be changed).
The data for each session is terminated with a forward slash. The ses-
sion may begin with a session label enclosed in angle brackets (this is
optional). If interruptions occur during sessions, thus segmenting them,
segment boundaries are indicated with semicolons (and so interruptions
can be taken into account when computing sequential statistics). Explicit
session start and stop times are optional. If given, they consist of a comma
followed by the time (see Case #2 in Figure 4.2). If start and stop times are
given, then rates can be computed later (see Chapter 8). Case #2 also shows
several codes on a line, which is a format some users may prefer. Spaces, as
well as tabs and line breaks, separate different elements; otherwise, you may
enter spaces and line breaks to format the file as you wish. In single-code
event sequences, when all codes are a single character, they do not need to
be separated (e.g., ABC is the same as A B C), provided you have checked
the single-character SDIS compiler option. This option makes manual data
entry easier for single-code event sequences (and is also valid for the inter-
val and multicode event sequences described subsequently).
48 Sequential Analysis and Observational Methods
Code 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Figure 4.3.╇ An example of an SDIS timed-event sequential data file, with data
shown in the grid at the top. Events may be listed one (Session #2) or more per line
(Sessions #1 and #3), as desired. See text for other details.
data. Session #2 also shows one event per line instead of several, which
some users may prefer€– but, as noted earlier, line breaks and spaces may be
entered where you wish. (Additional possibilities include what we call com-
bination codes and context codes; these potentially useful, but less frequently
used, options are described in the GSEQ help file.)
Session #3 shows how codes can be entered in more than one stream
and also shows the Code,OnsetTime format (omitting both offset time and
hyphen). The SDIS compiler expects codes within a session to be time-or-
dered, as in a single forward-flowing stream; that is, later onset times cannot
occur before earlier ones (this can be useful for finding data entry errors).
However, just as coders often find it convenient to make multiple passes
through a video record, entering data for each pass separately (e.g., coding
first mother and then child behavior), you may find it useful to enter data
in the SDIS file in more than one stream. SDIS conventions allow timed-
event data (and state data, described in the next paragraph) to be listed
50 Sequential Analysis and Observational Methods
<Session #1>
Quiet,1 MomTalks,2 KidTalks,5 Quiet,9 MomTalks,11
KidTalks,17 Quiet,19 MomTalks,24 Quiet,28 ,31/
<Session #2>
Quiet=1 MomTalks=3 KidTalks=4 Quiet=2 MomTalks=6
KidTalks=2 Quiet=5 MomTalks=4 Quiet=3/
< Session #3>
…/
Figure 4.4.╇ An example of an SDIS state sequential data file for the data shown in
Figure 4.3. See text for details.
Figure 4.5.╇ Examples of an SDIS interval sequential data file (based on Figure€3.4)
and an SDIS multicode event sequential data file (based on Figure 3.2). See text
for€details.
by the code (or codes). For interval data, an empty interval is indicated by
two commas with nothing between them. The repetition asterisk can also
be used with empty intervals (e.g., 5 empty intervals would be indicated
as€.╛.╛.╛,€5* ,╛.╛.╛.╛). Multicode data does not contain empty events, by definition.
Other conventions (e.g., for sessions and factors) have been described in
preceding paragraphs.
Alpha
Beta
...
Figure 4.6.╇ An example of a code-unit grid for which rows represent codes and
successive columns could represent either events, time units, or intervals.
Technology’s The Observer, but you can easily assemble a more extended
list with a few minutes of Internet search.
No matter what computerized systems you decide to use, once coding is
done, you are left with data files in a particular format. The question then
becomes, what is that format and what can you do with it? Many computer-
assisted coding programs provide summary statistics, and these may be suf-
ficient for your purpose. If not, or if you are not using such a program, you
may want to use GSEQ, which was designed not for initial data collection
but specifically for data analysis. GSEQ requires that observational data
be formatted per SDIS conventions, as described in earlier sections of this
chapter. Thus, unless a program you are using has the capability to write
SDIS-formatted files, you will need to convert the data files you have into
SDIS format. This is usually quite straightforward. Many programs prod-
uce files similar to that shown in Figure 3.3, listing codes along with their
onset and offset times. Such files can be converted into SDIS format using
search-and-replace and other editing capabilities of standard spreadsheet
and word-processing programs, or with custom conversion programs such
as those we have written (e.g., Bakeman & Quera, 2008).
summary
As observers code, they record their decisions, perhaps with marks on paper
that are transferred to electronic files later, although increasingly with key-
strokes that create electronic data files directly. When designing formats
for recording the observer’s initial decisions, ease and accuracy are import-
ant considerations. Once recorded, however, it makes sense to represent
�(literally re-present) the recorded data in formats that facilitate subsequent
analysis.
We have defined a set of formatting conventions€– the Sequential Data
Interchange Standard or SDIS format€ – that can be used to represent
observational data. These conventions, which were detailed in this chap-
ter, accommodate different recording strategies. The SDIS compiler trans-
forms SDIS-formatted data into a format based on a common code-unit
grid for subsequent analysis by computer programs such as the Generalized
Sequential Querier (GSEQ), a program we designed specifically for analysis
of sequential observational data.
Whatever system you use for recording coded data, you are left with data
files. If you enter data files yourself, perhaps from paper records, it is a sim-
ple matter to put them in SDIS format. If you use programs that produce
data files in their own format, it is usually a simple matter to convert such
56 Sequential Analysis and Observational Methods
57
58 Sequential Analysis and Observational Methods
for the data as collected meets accepted standards, so too will summary
scores derived from them. As a result, additional statistical proof for the
reliability of summary scores is rarely requested or provided.
In fact, different types of errors matter for point-by-point and for sum-
mary agreement. For point-by-point agreement, errors are qualitative and
may consist of disagreements€– an observer applies a different code from
the other observer or the gold standard; omissions€ – an observer fails to
detect an event that the other observer or the gold standard does; and
Â�commissions€– an observer detects an event that the other observer or the
gold standard does not (see “Errors of commission and omission” later in
this chapter). In contrast, for summary agreement, errors are quantitative
and occur when summary statistics are not the same. For point-by-point
agreement, the greater the qualitative error, the lower the kappa. For sum-
mary agreement, the greater the quantitative error, the lower the ICC. Issues
related to point-by-point agreement are the focus of this chapter and the
next; issues related to summary agreement are considered in Chapter 7.
Figure 5.1.╇ A kappa table tallying the frequency of agreements and disagreements by
two observers coding infant state for 120 intervals; p represents the marginal prob-
abilities and the lower-right .69 indicates the probability of observed agreement.
Po − Pc K K
κ= ,â•… with Po = ∑ piiâ•… andâ•… Pc = ∑ p+ i pi +
1 − Pc i =1 i =1
Alert 19 3 22 Cry 20 8 28
Other 8 90 98 Other 10 82 92
Total 27 93 120 Total 30 90 120
Fussy Other Total REM Other Total
Fussy 11 10 21 REM 18 10 28
Other 6 93 99 Other 7 85 92
Total 17 103 120 Total 25 95 120
Sleep Other Total
Figure 5.2.╇ The five 2×2 tables produced by collapsing the 5×5 table in Figure 5.1.
into a single Other category. Kappas for the five separate tables are given in
Figure 5.2 and indicate that, for this example, agreement was best for Alert,
worst for Fussy. However, observers did not in fact make binary decisions;
thus the collapsed 2×2 tables do not necessarily reflect the agreement that
would result had a binary coding scheme been applied. When collapsing
any kappa table into 2×2 tables in this way, you can be sure that some of
the kappas for the 2×2 tables will be greater and some less than the kappa
for the parent K×K table. This is because kappa is a weighted average of the
kappas for the K separate 2×2 tables (Fleiss, 1981). If you multiply each 2×2
kappa by its denominator (i.e., weight each kappa by its 1€– Pc), sum these
products, and then divide by the sum of the weights, the result will be the
kappa computed for the parent table.
Note that partial-interval sampling (see “Partial-interval or one-zero
sampling” in Chapter 3) also requires a 2×2 table approach. With partial-
interval sampling, an interval can be coded for more than one code from a
ME&E set, yet the kappa computation requires that each interval contribute
only one tally to the kappa table. In this case, define 2×2 tables, one for each
code; then tally for each code separately and compute kappas separately for
each table.
questions remain. First, is the value of kappa sufficient to satisfy us and others
that our observers agree€– if not perfectly, at least well enough? And second,
what standards should we apply to the values of kappa we compute?
Pmax − Pc K
κ max = ,â•… whereâ•… Pmax = ∑ min( p+ i , pi + )
1 − Pc i =1
1.00
.80
.60
Kappa
.40 Equiprobable
Moderately variable
Highly variable
.20
.00
1 2 3 4 5 6 7 8 9 10
Number of codes
Figure 5.3.╇ Expected values for kappa when number of codes and their prevalence
varies as shown for observers who are 95% accurate (top set of lines), 90% accurate
(2nd set), 85% accurate (3rd set), and 80% accurate (bottom set).
variable are.30, .42, .57, and .76; and when prevalence is equiprobable
are€.36, .49, .64, and .81 (for other values, see Appendix A).
Our computations make the simplifying assumptions that both observers
were equally accurate and unbiased, that codes were detected with equal
accuracy, that disagreements were equally likely, and that when prevalence
varied it did so with evenly graduated probabilities. To the extent these
assumptions seem reasonable, even when not met perfectly in practice, the
computed values should provide reasonable estimates for expected values
of kappa.
The low values of kappa expected with reasonable observer accuracy, but
with only two codes, may surprise some readers (but probably not investi-
gators actively involved in observer training). They certainly give encour-
agement to observers in training who have been told that an acceptable
kappa must be at least .60 (or some other arbitrary value). Note also that it
is the effective K€– not the actual K€– that may be at issue. If by chance two
observers were asked to code an infant session during which the infant was
asleep, the 5-category infant state scheme becomes in effect a 2-category
system (REM and Sleep), for which a kappa of .44 would suggest observer
accuracy of 90 percent (assuming REM occurred no less than 12.5 Â�percent€–
our definition of highly variable when K = 2).
As Figure 5.3 shows, values of expected kappa increase with the num-
ber of codes, other things begin equal. Consequently, it is puzzling that
the opposite has been claimed, and it is instructive to understand why.
Sim and Wright (2005), citing Maclure and Willett (1987), wrote that “The
larger the number of scale categories, the greater the potential for disagree-
ment, with the result that unweighted kappa will be lower with many cat-
egories than with few” (p. 264). Maclure and Willett presented a 12×12
kappa table tallying agreements and disagreements for ordinal codes. They
then collapsed adjacent rows and columns, producing first a 6×6, then a
4×4, a 3×3, and a 2×2 table. As expected with ordinal codes, disagreements
were not randomly distributed in off-diagonal cells but clustered more
around the diagonal and became less likely in lower-left and upper-right
cells. Not surprisingly, kappas computed for this series of collapsed tables
increased (values were .10, .23, .33, .38, and .55 for the range of tables from
12×12 to 2×2, respectively). Maclure and Willett wrote that “Clearly, in
this situation, the values for Kappa are so greatly influenced by the num-
ber of categories that a four-category-Kappa for ordinal data cannot be
compared with a three-category Kappa” (p. 163). Note that Maclure and
Willett did not claim that kappa would be lower with more codes gener-
ally, but only in the situation where ordinal codes are collapsed. In terms
68 Sequential Analysis and Observational Methods
and coordination (e.g., the National Institute of Child Health and Human
Development’s Study of Early Child Care, https://secc.rti.org). The poten-
tial reward can be a coherent, cumulative contribution to knowledge. From
this point of view, comparing fallible observers is the more common case,
because our research endeavors rarely represent sustained, coordinated
group efforts.
Figure 5.4.╇ Sensitivity-specificity table. Cell d can be tallied only for interval
recorded data.
summary
Observer accuracy is rightly called the sine qua non of observational
research,€and there are at least three major reasons why it is essential: first,
to assure ourselves that the coders we train are performing as expected;
second, to provide coders with the accurate feedback they need to improve
(and ourselves with information that may lead us to modify coding
schemes); and third, to convince others, including our colleagues and jour-
nal editors, that they have good reason to take our results seriously.
To assess observer accuracy, usually two observers are compared with
each other, but an observer’s coding could also be compared with a gold-
standard protocol that is presumed accurate. Gold standards take time to
prepare and confirm, but have advantages when coding spans considerable
time or different research teams and venues.
In either case, agreement is of two kinds. Point-by-point agreement
focuses on whether observers agree with respect to the successive intervals
or events coded (or rated), assumes nominal (or ordinal) measurement,
and primarily relies of some form of the kappa statistic. Point-by-point
agreement is especially useful for observer training prior to data collection
and for ongoing checking once data collection begins. In contrast, summary
agreement focuses on whether corresponding summary statistics agree. It
assumes at least ordinal or, more typically, interval or ratio scale measure-
ment and primarily relies on some form of the intraclass correlation coef-
ficient (ICC). Summary agreement is especially useful when a project has
Observer Agreement and Cohen’s Kappa 71
moved from data collection to analysis and when the reliability of particular
summary scores is at issue. Point-by-point agreement may be sufficient; it is
often accepted as evidence that summary measures derived from sequential
data will be reliable€ – probably because point-by-point agreement seems
the more stringent approach.
The statistic most commonly used for point-by-point agreement is
Cohen’s kappa (Cohen, 1960) or some variant of it. Cohen’s kappa is a
summary statistic that assesses how well two observers agree when asked
to independently assign one of K codes from a ME&E set to N discrete
entities. The N observer decisions are tallied in a K×K contingency table,
called a kappa table (also, agreement or confusion matrix). Cohen’s kappa
corrects for agreement due to chance€– which makes it preferable to per-
centage agreement, which does not. Values from€ –1 to +1 are possible;
positive values indicate better-than-chance agreement, near-zero values
indicate near-chance agreement, and negative values indicate worse-than-
chance disagreement.
Factors that affect values of kappa include observer accuracy and the
number of codes (the two most important), as well as codes’ individual
population prevalences and observer bias (how observers distribute individ-
ual codes). The maximum value of kappa is limited by observer bias; kappa
can equal 1 only when observers distribute codes equally. There is no one
value of kappa that can be regarded as universally acceptable; it depends of
the level of observer accuracy you want and the number of codes (i.e., num-
ber of alternatives among which observers select). Tables in Appendixes A
and B provide expected values of kappa for different numbers of codes and
varying observer accuracy.
6
72
Kappas for Point-by-Point Agreement 73
Figure 6.1.╇ Sequential data types and the appropriate kappa variant for each. To
apply the event alignment algorithm (see text for details) to multicoded events,
co-occurring events must first be recoded into single ones.
data (untimed or timed) only when previously demarcated events are pre-
sented to coders€ – for example, as turns of talk in a transcript. Usually,
however, events are not “prepackaged.” Instead, as noted in the previous
paragraph, observers first are asked to segment the stream of behavior into
events and only then to code those events. The two observers’ records fre-
quently contain different numbers of events due to commission-�omission
errors€ – one observer claims an event, the other does not. But even if
commission-�omission errors are absent, exactly how the events align is
not always �certain. And when alignment is uncertain, how to pair and tally
events in the kappa table is unclear.
This is long-standing problem. Bakeman and Gottman (1997) wrote that,
especially when agreement is not high, alignment is difficult and requires
subjective judgment. However, we have now developed an algorithm that
determines the optimal global alignment between two single-code event
sequences without subjective judgment (Quera, Bakeman, & Gnisci, 2007).
The problem is not unique to behavioral observation. In fact, an algorithm
that provides the optimal matching or alignment between two sequences
was developed independently by several researchers from different fields
during the 1970s (Sankoff & Kruskal, 1999) and has been re-invented sub-
sequently (e.g., Mannila & Ronkainen, 1997). Molecular biologists know
it as the Needleman-Wunsch algorithm and use it routinely for genetic
sequence alignment and comparison (Needleman & Wunsch, 1970; see also
Durbin, Eddy, Krogh, & Mitchison, 1998, and Sankoff & Kruskal, 1999).
The Needleman-Wunsch algorithm, on which our alignment algorithm is
based, belongs to a broad class of methods known as dynamic �programming.
With these methods, the solution for a specific subproblem can be derived
74 Sequential Analysis and Observational Methods
Nil — 1 2 0 0 3
Alert 0 5 0 0 0 5
Cry 0 0 1 1 0 2
Fussy 0 0 1 4 0 5
Sleep 1 0 0 0 4 5
Total 1 6 4 5 4 20
Event sequences:
S1 = ASFSFASCFAFSCAFSA
S2 = ASCFSCASCFCAFASFAFA
Event alignment:
AS-FSFASCF-AF-SCAFSA
|| ||:|||| || |:|| |
ASCFSCASCFCAFASFAF-A
Figure 6.2.╇ Two single-code event sequences, their alignment per the dynamic
programming algorithm as implemented in GSEQ, and the kappa table resulting
from tallying agreement between successive pairs of aligned events. For this
example, alignment kappa = .62. See text for other details.
Figure 6.3.╇ Two timed-event 20-minute sequences (in this case, state sequences)
with durations in seconds, and the kappa table resulting from tallying agreement
between successive pairs of seconds with no tolerance. For this example, time-unit
kappa = .70. See text for other details.
section, and has the merit of bringing the number of tallies more in line
with the number of decisions, but usually results in lower values for kappa.
As with single-code event sequential data, first events would need to be
aligned€– but taking time into account. Compared to tallying time units,
tallying agreements and disagreements between aligned events probably
underestimates the number of decisions observers actually make, but at
least the number of tallies is closer to the number of events coded than for
time-unit kappa.
To effect the required alignment, we modified our single-code event
alignment algorithm to take onset and duration times into account so
that it would work with timed-event sequential data (Bakeman, Quera, &
Gnisci, 2009). The modified algorithm requires that you specify values for
two parameters. The first is an onset tolerance€ – events with onset times
that differ by no more than this amount are aligned; thus even identically
coded events whose onsets differ by more than this amount can generate
commission-omission errors. The second is a percent overlap€– events that
overlap by this percent are regarded as agreements if identically coded and
regarded as disagreements if differently coded; but an event coded by one
observer that does not overlap by this percent with an event coded by the
other observer is regarded as an event the second observer missed.
Once aligned, events are tallied in a kappa table and kappa is computed
in the usual way (but using iterative proportional fitting for expected fre-
quencies due to the nil-nil structural zero). The result is a timed-event align-
ment kappa. Results for the data given in Figure 6.3 are shown in Figure€6.4,
along with a time plot. For the event plot, vertical bars indicate exact agree-
ment, two dots (colon) indicate disagreements, and hyphens indicate events
coded by one observer but not the other. For the time plot (the last 600
seconds of the Figure 6.3 data are shown), horizontal segments indicate
event durations, solid vertical lines between events indicate agreements,
dotted lines between events indicate disagreements, and dotted lines to top
or bottom indicate commission-omission errors. In this case, the �number
of events aligned was the same as for the event alignment (see Figure 6.2),
but the alignment was somewhat different. Due to differences in onset
times, Observer 2’s Fussy was regarded as an omission error on the part
of Observer€ 1, and thus Observer 1’s subsequent Fussy was paired with
Observer 2’s Cry, which counted as a disagreement.
With different data, however, or with different values for the tolerance
and overlap parameters, different alignments could have resulted€– perhaps
with more errors and lower kappas. As a general rule, unless two obser-
vers’ timed events align quite well, timed-event alignment kappas will be
80 Sequential Analysis and Observational Methods
nil — 1 1 1 0 3
Alert 0 5 0 0 0 5
Cry 0 0 1 1 0 2
Fussy 0 0 2 3 0 5
Sleep 1 0 0 0 4 5
TOTAL 1 6 4 5 4 20
Timed-event alignment:
AS-FSFASC-FAF-SCAFSA
|| ||:||| :|| |:|| |
ASCFSCASCFCAFASFAF-A
Alert
Cry
Fussy
Sleep
Alert
Cry
Fussy
Sleep
Figure 6.4.╇ Alignment of the two timed-event sequences shown in Figure 6.3 per
the dynamic programming algorithm as implemented in GSEQ (with 10-second
tolerance for onset times and 80% overlap for agreements-disagreements), and the
kappa table resulting from tallying agreement between successive pairs of aligned
events. For this example, timed-event alignment kappa = .56. See text for other
details.
lower than single-code alignment kappas, because when time is taken into
account, more commission-omission errors and more disagreements typ-
ically result. For this example, the alignment differed somewhat and the
value of the timed-event alignment kappa was .56 compared to .62 for the
single-code alignment kappa.
An alternative to the time-based alignment algorithm for timed-event
sequential data presented here is one proposed by Haccou and Meelis
(1992). It consists of a cascade of rules for alignment instead of the more
mathematically based approach to alignment of the Neddleman-Wunsch
algorithm. It has been implemented in at least two commercially available
Kappas for Point-by-Point Agreement 81
K K
∑ ∑ wij xij
i =1 j =1
κ wt = K K
∑ ∑ wij eij
i =1 j =1
where wij, xij, and eij are elements (i-th row, j-th column) of the weight,
observed, and expected matrices, respectively; eij = p+jâ•›xi+ where xi+ is the sum
for the i-th row and p+j is the probability for the j-th column (and where
p+j = x+jâ•›/N).
Any use of weighted kappa requires that you can convince others that the
weights you assign are sound and not unduly arbitrary. One set of weights
requires little rationale. If you weight all disagreements equally as 1, then
weighted kappa will have the same value as unweighted kappa. Otherwise,
if you weight different disagreements differently, be ready with convincing
reasons for your different weights.
In contrast to nominal codes, weights for ordinal ratings (or codes that
can be ordered) are much easier to justify. It makes sense that disagreements
between codes or ratings farther apart should be weighted more heav-
ily. The usual choice is either linear weights or, if you want disagreements
further apart treated even more severely, quadratic weights. Specifically,
wij = |Ci€– Cj| for linear and wij = |Ci€– Cj|2 for quadratic weights, where Ci and
Cj are ordinal numbers for the i-th row and j-th column, respectively, and
wij is the weight for the cell in the i-th row and j-th column (see Figure 6.5).
For the Figure 5.1 kappa table, values for unweighted kappa, weighted kappa
with linear weights, and weighted kappa with quadratic weights (shown to
five significant digits) are .61284, .62765, and .64379, respectively.
As Cohen appreciated (1968; also Fleiss & Cohen, 1973), when quad-
ratic weights are used and there is no observer bias, weighted kappa and
the intraclass correlation coefficient (ICC, as discussed in the next chapter)
have the same value. Perhaps for this reason, the statistically inclined find
quadratic weights “intuitively appealing” (Maclure & Willett, 1987, p. 164),
Kappas for Point-by-Point Agreement 83
Figure 6.5.╇ Two sets of weights for computing weighted kappa given four ordered
codes. Linear weights are on the left and quadratic weights on the right.
but either quadratic or linear weights could be used; and when disagree-
ments in fact cluster around the diagonal, values for weighted kappa com-
pared to unweighted kappa will be higher. More to the point, with ordinal
codes (and rating scales), weighted kappa may be a more accurate reflec-
tion of observer agreement than unweighted kappa because it treats dis-
agreements between ordinal codes farther apart as more severe than those
between codes closer together.
is especially useful for observer training prior to data collection and for
ongoing checking of observers once data collection begins. For these pur-
poses, the kappa table is more useful than the value of kappa; the kappa stat-
istic, by reducing the information in the table to a single number, obscures
the sources of agreement and disagreement identified in the table.
Presumably, observers want to perform well and get better. Presenting
them with a single value of kappa does not help much, but examining a
kappa table can. It is also a useful exercise for investigators. For example,
when the observers’ marginal probabilities for a given code differ, we know
that one observer overgeneralizes, and so overuses, that code relative to the
other observer€– which means that we need to spend more time sharpening
the definition of such codes and working with our observers to assure they
share a common understanding. Further, as we work with our observers to
understand why some off-diagonal cells of a kappa table contain many tal-
lies and others do not, we identify codes that are causing confusion, defini-
tions that require greater precision, concepts that need to be defined with
greater clarity, and possibly even codes that should be lumped or eliminated
from the coding scheme. Only by looking beyond kappa to its table do we
unlock its usefulness. Kappa can be used to mark progress in the course of
observer training; working with observers to understand discrepancies in
the kappa table can facilitate that progress.
From this point of view, the answer to the question€– Given timed-event
sequential data, which is better, a time-unit kappa or a timed-event align-
ment kappa?€– is both. Both are better because, first, their range likely cap-
tures the “true” value of kappa; but, second and more importantly, the two
kappa tables provide different but valuable information about agreements
and disagreements. The time-unit kappa table emphasizes how long agree-
ments and disagreements lasted, whereas the timed-event alignment kappa
table emphasizes agreements and disagreements with respect to the onsets
of events. A thoughtful examination of both tables can only help observers
as they strive to improve their accuracy.
summary
The classic Cohen’s kappa works well with single-code event data when
events are previously demarcated and also works well with interval sequen-
tial data because in both cases the entities to which codes are assigned are
identified before coding begins. However, when observers are asked to first
segment the stream of behavior into events (i.e., detect the seams between
Kappas for Point-by-Point Agreement 85
events) and only then code those events, agreement is more complicated.
Frequently the two observers’ records contain different numbers of events
due to commission-omission errors€ – one observer claims an event, the
other does not€– but even when the records contain the same number of
events, exactly how the events align is not always certain. And when align-
ment is uncertain, how events should be paired and tallied in the kappa
table is unclear.
To solve this problem, we developed an algorithm based on the Needleman-
Wunsch algorithm used by molecular biologists for genetic sequence align-
ment and comparison. It can be demonstrated that the method guarantees
an optimal solution€– that is, it finds the alignment with the highest pos-
sible number of agreements between sequences. Another advantage is that
the algorithm identifies commission-omission errors. Once aligned, paired
events can be tallied in a kappa table, and what we call an alignment kappa
can be computed for single-code event sequential data.
For timed-event sequential data, one possibility is to tally successive pairs
of time units as defined by the precision with which times were recorded
(recall the code–time-unit grid representation) and compute what we call a
time-unit kappa. Another possibility is to code as agreements any time units
coded similarly within some stated tolerance€– which results in what we call
a time-unit kappa with tolerance. A possible concern is that with the clas-
sic Cohen model, the number of tallies represents the number of decisions
coders make, but with time-unit kappa, the number of tallies reflects the
length of the session. With timed-event recording, observers are continu-
ously looking for the seams between events, but how often they are making
decisions is arguable; one decision per seam seems too few, but one per time
unit seems far too many. To address this concern, we adapted our event
alignment algorithm for use with timed-event data. With it, timed events
can be aligned, and what we call a timed-event alignment kappa can be com-
puted. Then both time-unit kappa and timed-event alignment kappa can be
reported for timed-event data.
When some agreements are regarded as more serious than others,
weighted kappa may be useful; this computational variant allows the user to
provide different weights for each possible disagreement. When codes are
ordered or represent ratings, disagreements between codes farther from the
diagonal may be assigned higher weights. In such cases, arguably weighted
kappa may be a more accurate reflection of observer agreement.
The value of kappa may be overemphasized. Especially for observer
training, the kappa table and its marginal probabilities are more useful than
86 Sequential Analysis and Observational Methods
87
88 Sequential Analysis and Observational Methods
Figure 7.1.╇ Summary contingency indices for ten targets (sessions) derived from
data coded by two observers, their analysis of variance statistics, and the formulas
and computations for ICCrel and ICCabs, respectively; k = number of observers and
n = number of targets.
summary
Once data collection is complete and summary scores have been computed
from sequential data (e.g., rates and proportions for individual codes,
contingency table indices involving two or more codes), reliability of par-
ticular summary scores can be assessed with an intraclass correlation coef-
ficient (ICC). Computation of an ICC requires a reliability sample€– paired
values for a particular summary statistic derived for several targets (often
Â�sessions)€– with scores computed from data coded independently by two or
more observers. ICCs come in several forms; of the two relevant for obser-
ver reliability, one assesses relative consistency and one absolute agreement.
When the entire corpus of a study is coded by one observer, it may make
sense to use the relative consistency ICC when establishing reliability; but
when several observers are employed, each coding a part of the corpus, it
may make more sense to use the absolute-agreement ICC instead. In either
case, a rationale is required.
8
After data collection is complete, and before data are analyzed, many
measurement methods require intervening steps€ – some sort of data
reduction€– even if it is only computing a score from the items of a self-
report measure. Behavioral observation, however, seems to require more
data reduction than most measurement methods. Rarely are the coded
data analyzed directly without intervening steps. First, summary scores
of various sorts are computed from the event, timed-event, and inter-
val sequential data produced by the coders. In other words, the data-as-
collected, which usually reflect categorical measurement, are transformed
into summary scores for which ratio-scale measurement can usually
be€assumed.
As with scores generally, the first analytic steps for summary scores
derived from behavioral observation involve quantitative description.
Descriptive results for individual variables (e.g., means, medians, and
�standard deviations, as well as skewness and distribution generally) are
important€ – first, of course, for what they tell us about the behavior we
observed, but also because they may define and limit subsequent analyses.
Limited values or skewed distributions, for example, may argue against
analysis of variance or other parametric statistical techniques. But what
�summary scores should be derived and described first?
When taking these first descriptive steps, it is useful to distinguish
between simple statistics for individual codes that do not take sequencing
or contingency into account (described in this chapter) and contingency
and other table statistics involving two or more codes that do (described in
the next chapter). It makes sense to describe statistics for individual codes
first€ – if their values are not appropriate, computation of some contin-
gency and other table statistics may be precluded or at best be questionable.
Statistics that characterize individual codes are relatively few in number,
93
94 Sequential Analysis and Observational Methods
Figure 8.1.╇ An SDIS timed-event data file with 1-second precision (top) and
an SDIS interval data file with 1-second intervals (bottom) describing the same
events.
although their interpretation can vary depending on the data type (event,
timed-event, and interval, where event includes single-code and multicode
variants). In this chapter we describe these simple statistics, note how data
type affects their interpretation, and recommend which are most useful for
each type of sequential data.
We illustrate these simple statistics with two brief data files that describe
exactly the same events (see Figure 8.1). The codes are modeled on ones
used to describe children undergoing potentially painful medical proce-
dures as in Chorney, Garcia, Berlin, Bakeman, and Kain (2010). Calm, Cry,
and Fuss form a ME&E set used to describe a child’s behavior; and Assure,
Explain, and Touch are three behaviors that nurses or other medical pro-
fessionals might use in attempts to quiet an upset child. The timed-event
file (top) is formatted for timed-event data with 1-second precision, and
the interval file (bottom) is formatted for interval data with a 1-second
�intervals. Both files code 60 seconds of behavior as shown graphically in
Figure 8.2. Together they illustrate the point that, when the precision used
for timed-event data is the same as the interval duration for interval data, the
code-unit grid representation is the same for both data types (in fact, rarely
would a 1-second interval be used for interval recording). One �caveat: this
60-second example is not based on actual data; there is no reason to think
that its sequences or durations reflect real results. The two example data
files are here simply to illustrate statistical definitions and computations
Summary Statistics for Individual Codes 95
Second or interval
Code 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Figure 8.2.╇ A code-unit grid for the timed-event data (60 seconds) and the interval
data (60 intervals) shown in Figure 8.1.
and to show that these definitions produce identical results for the timed-
event and interval data files shown in Figure 8.1.
Frequency
Frequency indicates how often. It is not adjusted for sessions that vary in
length. In contrast, rate (defined shortly) is the frequency for a specified
time interval. In defining frequency and rate this way, we are following
standard statistical and behavioral usage. Some other fields (e.g., math-
ematics and physics) define frequency as the number of times a specified
phenomenon occurs within a specified interval; thus where we say fre-
quency and rate, they would say occurrence and frequency. (Note: Martin
and Bateson, 2007, like physicists, define frequency as the number of
�occurrences per unit time, but acknowledge that defining frequency as the
96 Sequential Analysis and Observational Methods
Relative frequency
Relative frequency indicates proportionate use of codes and is defined rela-
tive to a specified set of codes.
For all data types, relative frequency is a code’s frequency, as just
defined, divided by the sum of frequencies for the codes specified€– hence
the relative frequencies computed for a specified set necessarily sum to 1.
(Alternatively, relative frequencies can be expressed as percentages sum-
ming to 100 percent.) For example, when Calm, Cry, and Fuss are specified,
their relative frequencies for the Figure 8.1 data are 33 percent, 25 percent,
and 42 percent, respectively. Some research questions require relative fre-
quency and some do not, and occasionally investigators report relative fre-
quency when another statistic is needed. Consequently, you should provide
an explicit rationale when using this statistic, along with the set of codes
used to compute
� it.
Rate
Rate, like frequency, indicates how often. It is the frequency per a specified
amount of time. Because it can be compared across sessions, rate is pref-
erable to frequency when sessions vary in length. (As noted, this is how
Martin and Bateson, 2007€– and mathematicians and physicists generally€–
define frequency.)
Summary Statistics for Individual Codes 97
For all data types, rate is computed by dividing frequency, as just defined,
by the duration of the session. Its units may be adjusted (i.e., multiplied by
the appropriate factor) to be per minute, per hour, or per any other time
unit that makes sense. Rate cannot be computed unless the session �duration
is known. For event data (both single-code and multicode), duration is
�session stop time minus session start time. For timed-event data, duration
can be derived directly from the data. For interval data, a code’s duration is
the number of intervals multiplied by the interval duration. For example,
because 1-second intervals were used for the data shown in Figure 8.1, rates
for Calm, Cry, and Fuss were 4, 3, and 5 per minute, respectively. With the
same frequencies and 5-second intervals, rates (i.e., intervals checked per
unit time) would be computed as 0.80, 0.60, and 1.00 per minute or 24, 18,
and 30 per half hour.
Duration
Duration indicates how long or how many. Like frequency, it is not adjusted
for sessions that vary in length. For single-code event data, duration is the
same as frequency. (Note: Martin and Bateson, 2007, define duration differ-
ently. Their definition for duration is the length of time for which a single
occurrence of the behavior pattern lasts€– what we call duration they call
total duration.)
For timed-event data, duration indicates the amount of time devoted to
a particular code during the session. In contrast, for interval and multicode
event data, duration indicates the number of intervals or the number of
multicoded events checked for a particular code (i.e., the units are intervals
or events, not standard time units like seconds). For example, for the inter-
val data shown in Figure 8.1, durations for Calm, Cry, and Fuss were 16, 19,
and 25 intervals.
Further€– and this is something of a technical digression€– an estimated
duration that takes interval duration (symbolized as w for width) and sam-
pling method into account can be computed from interval data by applying
a correction suggested by Suen and Ary (1989). Using the definitions for
code duration (d╛) and frequency (╛f╛↜) given in the preceding paragraphs,
estimates for momentary, partial, and whole-interval sampling (see “Interval
recording” in Chapter 3) will then be wd, w(d€– f╛↜), and w(d + f╛↜), respect-
ively. Both the current and earlier versions of GSEQ compute this estimated
duration, but it has rarely been reported in the published literature. Usually
the number of checked intervals is reported as the duration, but it is then
interpreted in light of the interval duration and sampling method.
98 Sequential Analysis and Observational Methods
Relative duration
Relative duration indicates proportionate use of time for timed-event data
and proportionate use of intervals or events for interval and multicode
event data. Like relative frequency, it is defined relative to a specified set of
codes. For single-code event data, relative duration is the same as relative
frequency.
For timed-event, interval, and multicode event data, relative duration is
a code’s duration, as just defined, divided by the sum of the durations for
the codes specified€– hence the relative durations computed for a specified
set necessarily sum to 1. (Alternatively, relative durations can be expressed
as percentages summing to 100 percent.) To be interpretable, the specified
codes should be at least mutually exclusive, if not ME&E. For example, when
the ME&E Calm, Cry, and Fuss are specified, their relative durations for
the Figure 8.1 data are 27 percent, 32 percent, and 42 percent, respectively.
Because these codes are ME&E, we know that the child was fussing for 42
percent of the session (or was coded fussing for 42 percent of the intervals).
When the codes specified are mutually exclusive, but not exhaust-
ive, the interpretation is somewhat different. If a mutually exclusive€– but
not exhaustive€ – set of codes for mother vocalizations including Naming
were defined, we might discover, for example, that 24 percent of the time
when mother’s vocalizations were occurring they were coded Naming. But
Naming was not coded for 24 percent of the total session because the codes
were not exhaustive. As with relative frequency, to avoid possible misuse,
you should provide an explicit rationale when using this statistic.
Probability
Probability indicates likelihood and can be expressed as either a proportion
or a percentage. Just as rate is preferable to frequency, so too probability
is preferable to duration when sessions vary in length, and for the same
Â�reason€ – both rate and probability can be compared across sessions. For
single-code event data, probability is the same as relative frequency.
Probability, as we define it here, indicates the proportion or percentage
of a session devoted to a particular code (as with other statistics defined
here, we treat it as a sample statistic, not an estimate of a population par-
ameter as statisticians might). For timed-event data, it is the code’s dur-
ation divided by the session’s duration; and for interval and multicode
event data, it is the code’s duration as just defined divided by the total
number of intervals or multicoded events. For example, probabilities for
Summary Statistics for Individual Codes 99
Data Type
Figure 8.3.╇ Formulas for six basic simple statistics. See text for explanation
and details.
Calm, Cry, and Fuss for the Figure 8.1 data are 27 percent, 32 percent, and
42 percent, respectively. These are the same values as for relative duration,
but only because these three codes are ME&E. If additional codes were
specified for the probability calculation, the probability values would stay
the same, but values for relative duration would change (as would their
interpretation).
Figure 8.3 summarizes these six statistics and shows how they are com-
puted for each data type. For interval and multicode event data: (1) E, and
not F, is used for frequency to remind us that, when tallying their fre-
quency, episodes are tallied€– that is, uninterrupted stretches of intervals or
multicoded events checked for the same code; (2) duration is the number of
intervals or multicoded events checked for a given code; and (3) probability
is computed relative to the total number of intervals or multicoded events.
Rate, however, is computed relative to total duration, which is inferred from
the data for timed-event data, from explicit start and stop times for event
data (both single-code and multicode), or from the number of intervals
multiplied by the interval duration for interval data.
Mean Gap
Gaps are of two kinds: between consecutive occurrences of events€– that is,
from one offset to the next onset of the same code; and between onsets of
those events€– that is, from one onset to the next onset of the same code.
For timed-event data, mean gap indicates the average time between events
or between onsets of those events. When computed for interval or multi-
code event data, mean gap indicates the mean number of successive inter-
vals or multicoded events between ones checked for a particular code (i.e.,
the€ mean number between episodes) or between ones checked for that
code’s onsets.
Latency
For timed-event data, latency is defined as the time from the start of a ses-
sion to the first onset of a particular code (not as the time between a pair
of behaviors generally, a meaning that is also encountered in the literature).
For interval or multicode event data, it is the number of intervals or mul-
ticoded events from the start of the session to the first interval or event
checked for a particular code. If a session consists of more than one stream
(see “Timed-event and state sequences” in Chapter 4), latency is computed
as the mean of the latencies for the separate streams.
GSEQ can compute values for the six basic statistics described in the
previous section for whichever codes you specify. It can also compute
values for mean event duration, gap, and latency, as well as their minimum
and maximum values (GSEQ calls these simple statistics to distinguish them
from the table statistics described in the next chapter). For example, for the
Figure 8.1 data, the mean event duration for Calm is 4 (min = 3, max = 5), the
mean gap between times (or intervals) coded Calm is 10.7 (min = 9, max€=
13), the mean gap between onsets of Calm is 14.3 (min = 12, max = 17),
and the latency for Calm is 9.
Summary Statistics for Individual Codes 101
experienced one kind of event relative to other events matters (e.g., 32 per-
cent of the time the child was coded Cry). Whether you analyze one or
both, once again explicit rationales are clarifying and appreciated, and show
others that your decisions were well thought out.
With regard to mean event duration, consider the triad of mean event
duration, frequency (or rate), and duration (or probability). These three are
correlated because mean event duration is (total) duration divided by fre-
quency; in effect, the three yield two degrees of freedom. Due to this redun-
dancy, it does not make sense to analyze all three. Instead pick the two that
seem most meaningful in the context of your research questions€– or if you
describe all three, be aware that analyses of them are not independent.
For interval and multicode event data, the options and choices are simi-
lar to those for timed-event data; however, because the underlying units
are intervals or events, and not time units, the interpretation is somewhat
different. For example, for the Figure 8.1 data either you would report that
4 episodes were coded Calm (successive intervals or multicoded events all
coded Calm) and they lasted for a total of 16 intervals or multicoded events,
or you would report that an episode was coded Calm 4 times per minute
and 27 percent of the intervals or multicoded events were coded Calm. You
could also report that 25 percent of the child’s episodes were coded Cry, but
32 percent of the intervals or multicoded events were coded Cry.
summary
Data reduction€ – computing summary scores from the coded data€ – is
an essential first step when analyzing observational data. Such summary
scores, which typically assume ratio-scale measurement, are important
both in themselves and because they may limit subsequent analyses. For
example, when the distribution for a summary score is skewed, some sort
of recode or transformation, use of nonparametric statistics, or both may be
indicated. Summary scores are of two kinds: those computed for individual
codes (as described in this chapter) and those computed from contingency
tables (as described in the next chapter). Statistics for individual code sum-
mary scores are described first because, if their values are not appropriate
(e.g., many zeros, excessively skewed, or both), computation of some con-
tingency and other table statistics may be questionable.
Basic statistics for individual codes include frequency, relative frequency,
rate, duration, relative duration, and probability. For single-code event data,
only frequency, relative frequency, and rate need be computed (because they
are the same as duration, relative duration, and probability, respectively).
Summary Statistics for Individual Codes 103
Basic statistics for timed-event, interval, and multicode event data are simi-
lar, but their interpretation varies because the units are different (time units,
intervals, and multicoded events, respectively). Which �statistics you choose
to report depends on your research questions. Essentially, frequency and
rate indicate how often, duration indicates how long (time units) or how
many (intervals or multicoded events), and probability indicates likelihood
(i.e., proportion or percentage).
Additional statistics useful for describing individual codes are mean
event duration, mean gap (between codes and between their onsets), and
latency€– and minimum and maximum values for each. Whatever statis-
tics you select to characterize individual codes, explicit rationales for your
choices are clarifying and appreciated.
9
104
Cell and Summary Statistics 105
Figure 9.1.╇ Definitions for five basic cell statistics and the notation used to
describe€them.
to the table; in other words, the total number of tallies equals the number
of units cross-classified.
The codes that define the rows (the given codes) of the Râ•›×â•›C table must
constitute a mutually exclusive set€– or at least be treated as though they
did by following the hierarchical tallying rule explained shortly. In most
cases, they must also be exhaustive (a narrowly targeted research question
that restricts the codes considered might be an exception). The codes that
define the columns (the target codes) must likewise be mutually exclusive
and usually exhaustive. Almost always with timed-event, interval, and mul-
ticode event data, columns are not lagged relative to rows. It is expected that
some row codes will co-occur with some column codes; indeed, such co-
occurrences, or their lack, are often central to our research questions. For
interval, multicode event, and especially timed-event data, lagged sequen-
tial associations are best analyzed by defining time windows anchored to
existing codes (see “Creating new codes as ‘windows’ anchored to existing
codes” in Chapter 10 and “Time-window sequential analysis of timed-event
data” in Chapter 11). There is one strategy we would not recommend: It
makes little sense to mindlessly include all codes as both givens and targets
just “to see what happens.” For one thing, codes will, of course, co-occur
with themselves. When selecting given and target codes, a thoughtful and
an explicit rationale is important.
Tallying proceeds as follows. Each successive unit is scanned for a given
(row) code. Given codes are considered in the order listed. If one is found,
scanning for given codes stops and scanning for target (column) codes
begins. Consequently, if a unit contains more than one given code, only
the one encountered first in the list is used to cross-classify it. Target codes
are likewise considered in the order listed. If one is found, scanning for
target codes stops and a tally is added to the table for that given-target
code pair. This is what we mean by hierarchical tallying€– which, in effect,
makes any list of codes mutually exclusive. For each successive unit, if no
given code is found€– or if a given code but no target code is found€– no
tally is added to the table. Therefore, to ensure that the total number of
tallies equals the total number of units, both given and target codes must
be exhaustive€– which in GSEQ may be accomplished with a residual code,
indicating anything-else or none-of-the-above, and signified with an &
(i.e., an ampersand).
To illustrate the computations in this chapter, we again use the data in
Figure 8.1. Assume that we tallied the time units or intervals for these data
in a 3â•›×â•›3 table whose rows were Calm, Cry, and Fuss (a ME&E set) and
whose columns were Assure, Explain, and a third column labeled &, which
makes the column codes exhaustive (here, by indicating any time unit or
Cell and Summary Statistics 107
interval in which neither Assure nor Explain were coded). The tallies for
the 60 time units or intervals are shown in Figure 9.2. For example, of 60
seconds (or intervals), only 1 was coded both Calm and Assure but 10 were
coded both Calm and Explain. (As a general rule, row and column codes
should be exhaustive so that all units are tallied. Occasionally, however, a
narrowly targeted research question may only require that a subset be tal-
lied€– for example, if our only interest were in comparing Calm-Assure and
Calm-Explain associations.)
Figure 9.3.╇ Observed Lag 1 counts and transitional probabilities for Figure 8.1 data
after being converted into single-code event data with Assure, Explain, and Touch
removed; codes cannot repeat. Dashes indicate structural zeros.
probability of Assure given Calm was .06, but of Assure given Fuss was .49.
Descriptively, it appears that Fuss and Assure were often associated, but
Calm and Assure seldom were.
For single-code event data, conditional probabilities are computed the
same way, but are called transitional probabilities because they indicate
transitions of some lag. For Lag 1, when codes cannot repeat, conditional
probabilities on the diagonal€ – like the observed joint frequencies from
which they are computed€– are structural zeros. For example, as shown in
Figure 9.3, the transitional probability of Fuss at Lag 1 after Calm is .75, but
the simple probability of Fuss is .36.
Conditional probabilities reflect target behavior frequencies and so
are not comparable over sessions. For this reason, although it may seem
tempting to regard particular conditional probabilities as outcome scores
for individual sessions and to analyze them using standard statistical tech-
niques, this is not recommended. There are better 2â•›×â•›2 contingency index
alternatives, as discussed shortly.
compute other statistics€– the most useful of which is the adjusted resid-
ual. The adjusted residual indicates the extent to which an observed joint
frequency differs from chance: It is positive if the observed is greater than
chance and negative if the observed is less than chance. If there is no asso-
ciation between given and target codes, then adjusted residuals are dis-
tributed approximately normally with a mean of 0 and variance equal 1
(granted assumptions), and so their magnitudes can be compared across
various pairs of codes within the same contingency table€– this is perhaps
their major merit (Allison & Liker, 1982; Haberman, 1978). For example,
Explain was more likely given Calm (z = 2.54) and Assure was more likely
given Fuss (z = 2.58), whereas Explain was less likely given Fuss (z =€–0.63)
and Assure less likely given Calm (z =€–1.89). Note, assuming the normal
approximation is justified, the first two€– but not the last two€– adjusted
residuals reached the 1.96 (absolute) criterion required to claim signifi-
cance at the .05 level (see Figure 9.2).
Adjusted residuals have limitations. One limitation is that the distri-
bution of adjusted residuals only approximates the standard normal dis-
tribution. The larger the row sum (xr+) and the less extreme the expected
probability (erc/xr+), the better the approximation. A reasonable guideline
is to assume that adjusted residuals are approximately normally distributed
only when the row sum is at least 30, and then only when the expected
probability is > .1 and < .9 (Haberman, 1979). Even when these guidelines
are met, a second limitation involves type I error; a single table may contain
several adjusted residuals, each of which is tested for significance. Because
comparing each to a 1.96 absolute, p < .05, criterion invites type I error,
some investigators may prefer a more stringent criterion, for example, 2.58
absolute, p < .01, or even an arbitrary criterion like 3 absolute (Bakeman &
Quera, 2012). Another possibility is a winnowing strategy that identifies the
most consequential adjusted residuals (see “Deviant cells, type I error, and
winnowing” in Chapter 10).
One final cell statistic, not included in Figure 9.1 and seldom encoun-
tered, is the standardized residual€– defined as the raw residual divided by
the square root of the expected frequency, (xrc€– erc) ÷ √erc. However, the
adjusted residual as defined in Figure 9.1 offers a better approximation to
a normal distribution and, for that reason, is preferable (Haberman, 1978).
It would make sense to call it standardized, but when the adjusted residual
was defined, the term standardized residual was already in use with the def-
inition just given (Haberman, 1979)€– and so the better approximation is
known as the adjusted residual.
Cell and Summary Statistics 111
R C
( x rc − erc ) 2
χ
2
Pearson chi-square = ∑∑
r =1 c =1
erc
R C
Likelihood-ratio =
G
2
chi-square
2 ∑∑ x
r =1 c =1
rc (ln x rc − ln e rc )
Target:
Yes No
Given: Yes a b
No c d
a / b ad
OR Odds ratio = =
c / d bc
lnOR Log odds ratio = ln(odds ratio)
ad − bc
Q Yule’s Q =
ad + bc
Figure 9.5.╇ Notation and definitions for three basic 2×2 contingency indices.
Fuss 10 15 25 Cry 3 16 19
& 4 31 35 & 11 30 41
Total 14 36 60 Total 14 46 60
Figure 9.6.╇ Two 2â•›×â•›2 contingency tables for the Figure 8.1 data with their associated
odds ratios (95 CIs for the ORs are given in parentheses), log odds, and Yule’s Qs.
in two different ways. For the left-hand table, the odds of Assure to any code
not Assure (&) when Fuss was coded were 10 to 15, but the odds were 4 to
31 when Fuss was not coded. For this example, the odds ratio is 5.17 (10
÷ 15 divided by 4 ÷ 31). Concretely, this means that the odds of the nurse
offering assurance when the child was fussing were more than five times
greater than when the child was not fussing. Moreover, because the 95 per-
cent confidence interval (CI) does not include 1€– which is the no-effect
value€– this result is statistically significant, p < .05. In contrast, assurance
was about half as likely when the child was crying than when not, but the
association was not statistically significant (see Figure 9.6, right).
The log odds is the natural logarithm of the odds ratio. It varies from
negative infinity to positive infinity, with zero indicating no effect.
Compared to the odds ratio, its distributions are less likely to be skewed.
(Note, as with all scores, skew should be checked before analysis; if scores
are skewed, nonparametric analyses should be considered or scores should
be recoded or transformed before parametric analyses). The log odds is
expressed in difficult-to-interpret logarithmic units€– which can be a limi-
tation. For example, for the Fuss-Assure association, the natural logarithm
of the odds ratio is logeâ•›5.17 = 1.64 (i.e., 2.718.â•›.â•›.1.64 = 5.17), which is diffi-
cult to interpret concretely. Thus the odds ratio is better descriptively, but
the log odds is often the better choice when using standard parametric
statistical techniques such as correlation, multiple regression, and analysis
of€variance.
Investigators sometimes ask whether an individual odds ratio is statis-
tically significant€– meaning significantly different from 1 in a sample of the
size used for computation. We are of three minds on this. First, we think€–
as do others (e.g., Wilkinson and the Task Force on Statistical Inference,
1999)€– that statistical significance is often overrated and overemphasized,
and that equal emphasis on effect size is desirable. Second, it is nonetheless
114 Sequential Analysis and Observational Methods
useful to compute and report 95 percent CIs for odds ratios (as GSEQ and
most statistical programs do); if 1 is not included in the CI, the odds ratios
differ from 1, p < .05. Third, guidelines€– understood with the appropriate
grain of salt€– can be useful (e.g., Cohen’s, 1977, suggestion that Pearson
product-moment correlations of .10, .30, and .50 represent small, medium,
and large effect sizes).
With regard to odds ratios, a general guideline suggested by Haddock,
Rindskopf, and Shadish (1998) is that odds ratios close to 1.0 indicate
weak relationships, whereas odds ratios over 3.0 for positive associations
or less than 0.33 for negative associations indicate strong relationships.
Additionally, we think that odds ratios between 1.25 and 2.00 (or 0.50–
0.80) should be regarded as weak, and those between 2.00 and 3.00 (or
0.33–0.50) should be regarded as moderate. Our rationale is that increas-
ing the odds 100 percent, which is what an odds ratio of 2.00 does, is
a reasonable definition for moderate (Parrott, Gallagher, Vincent, &
Bakeman, 2010). Moreover, our cut points for the odds ratio correspond
to values of .11, .33, and .50 absolute for Yule’s Q, an index of association
for 2â•›×â•›2€tables that ranges from€–1 to +1 and is discussed next. Note that
these cut points for Yule’s Q are almost the same as Cohen’s for r (1977,
see previous paragraph).
Yule’s Q
Yule’s Q is an index of effect size (see Figure 9.5). A straightforward alge-
braic transformation of the odds ratio (see Bakeman, 2000), it is like the
familiar correlation coefficient in two ways€– it varies from€–1 to +1 with 0
indicating no effect, and its units have no natural meaning. Consequently,
its interpretation is not as concrete or clearly descriptive as the odds ratio.
On the other hand, compared to the odds ratio, its distributions are less
likely to be skewed, and so it can be used both descriptively and analyt-
ically (assuming distributions are not badly skewed). It is also somewhat
less vulnerable to a zero cell count than the odds ratio, as described in the
next section.
One final 2â•›×â•›2 cell statistic, not included in Figure 9.1, but often listed
in older texts, is the phi coefficient (Hays, 1963; see also Bakeman, 2000). It
is a Pearson product-moment correlation coefficient computed with bin-
ary data. Like Yule’s Q, it can potentially assume values from€–1 to +1, but
can only achieve its maximum value when pr = pc = .5; thus Yule’s Q almost
always seems preferable.
Cell and Summary Statistics 115
think that the value for a contingency index should be treated as missing
if any row or column sum is less than 5 (this is the default value supplied
by GSEQ), but some investigators may prefer a more stringent guideline.
summary
The summary statistics described in the previous chapter were computed
for individual codes. In contrast, the statistics described in this chapter are
derived from two-dimensional contingency tables whose rows and columns
are defined with two or more codes. These statistics are of three kinds. The
first kind are primarily descriptive statistics for individual cells, the second
kind are summary indices of association for tables of varying dimensions,
and the third kind€– and most important for sequential analyses€– are sum-
mary statistics specifically for 2â•›×â•›2 tables. By convention, we call the row
codes the givens and the column codes the targets.
Individual cell statistics include observed joint frequencies, joint fre-
quencies expected by chance (i.e., assuming no association between row
and column codes), conditional probabilities, raw residuals (differences
between observed and expected frequencies), and adjusted residuals.
Adjusted residuals are especially useful because they allow comparisons of
given-target code pairs within a particular table; granted assumptions and
sufficient counts, they are distributed approximately normally.
The observed cell frequencies for timed-event, interval, and multicode
event data are the tallies that result from cross-classification. Each of the
session’s units is considered in turn and, depending on how it is coded, a
tally is added to one of the cells of the contingency table. The order in which
the row and column codes of the Râ•›×â•›C table are listed matters. All units
must add one, and only one, tally to the table so that the total number of
tallies equals the number of units cross-classified. Tallying follows a hier-
archical rule: If a unit contains more than one given (or target) code, only
the one encountered first in the list is used to cross-classify the unit. Usually
columns are not lagged relative to rows (Lag 0).
In contrast, for single-code event data, events cannot co-occur. Lag 0
would result in a table with structural zeros in off-diagonal cells and code
frequencies on the diagonal, so typically columns are lagged relative to
rows. When codes cannot repeat, zeros on the diagonal of a Lag 1 table are
structural, and so expected frequencies need to be computed with an itera-
tive proportional fitting algorithm instead of the usual formula.
Indices for two-dimensional tables include the well-known χ2 (Pearson
chi-square); the similar G2 (likelihood-ratio chi-square), which is used in
Cell and Summary Statistics 117
It makes sense to define codes and use recording procedures that work best
for the observers. After all, if we expect good work, we should accommodate
observers’ comfort; the data as recorded can be modified later into forms
that facilitate analysis. In Chapter 4, we argued the utility of representing
observational data with a few standard formats (i.e., single-code event,
timed-event, interval, and multicode event data) and then conceptualizing
such data as a code-unit grid€– partly for the order and organization doing
so brings to observational data, but also for the opportunities it presents for
later data modification. In this chapter we describe, among other matters,
data modification, that is, specific ways new codes can be created from exist-
ing codes€– new codes that are faithful to and accurately reflect our research
questions and that extend the range of our data-analytic efforts.
Given the benefits, it is a bit puzzling that data modification of obser-
vational data is not more common. Perhaps it is because data modifica-
tion occupies something of a middle ground. On the one hand, there are
a number of systems for computer-assisted coding that facilitate the initial
recording of observational data; most produce the sorts of summary scores
described in Chapter 8. On the other hand, there are a number of statistical
packages that permit often quite complex data recoding and transformation
of summary scores. But no coding or analysis programs we know of address
the need to modify sequential data before summary scores are computed.
In this respect, GSEQ, with its extensive and flexible data modification cap-
abilities, may be uniquely helpful.
118
Preparing for Sequential and Other Analyses 119
new codes from existing ones; they add these new codes to the MDS file but
leave the initial SDS file unchanged.
In this section we define the data modification commands that are imple-
mented in GSEQ. Several depend on standard logical operations; a few are
simply housekeeping; several work only with timed-event, interval, and
multicode event data; and a few others work only with single-code event
data. In GSEQ, after making a series of data modifications (including the
WINDOW command described in the next section), you have the option to
overwrite the existing MDS file or to create a new MDS file. If you choose to
create a new file, it contains the modifications you just made and the earlier
MDS file remains intact. The new codes created by the data modification
commands can be used in any subsequent analyses (summary statistics,
contingency tables, etc.).
Second or interval
Code 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Fuss √+ √++++++++
Explain √ ++++++ √+++++
Touch √+++++++ √ +++
And √+
Or √+ √ +++++++ ++++++++++ √ +++
Not √ +++++++++++ √ +++++++++++++++
Nor √++++ √
Xor √+ √ +++ √ +++
Recode √+ √ +++√ √ ++ ++++√ +++++ √ +++
Figure 10.1.╇ Use of logical combinations and the RECODE command to create
new codes from existing ones, assuming 1-second precision for timed-event data
or 1-second intervals for interval sequential data; results would be the same for
multicode event data. Each of the six commands at the bottom specified the three
codes at the top. A check mark indicates the onset of an event (or episode) and a
plus sign its continuation.
AND: The new code is checked for any time unit (or interval or multicoded
event) that is not coded for all of existing codes specified. NOR is the com-
plement of OR: The new code is checked for any time unit (or interval or
multicoded event) that is not already coded for one or more of the existing
codes specified. Finally, XOR is the exclusive OR: The new code is checked
for any time unit (or interval or multicoded event) that is already coded for
one, and only one, of the existing codes specified.
3 with OR, but 5 with RECODE€ – the number of existing events com-
bined to create the new code. When combining existing events into a new
code, use OR if you think of the merged event as becoming a single occur-
rence, but use RECODE if you think of the merged events as remaining
separate but contiguous events. RECODE differs from OR in another way
as well. Existing codes used to define OR remain in the GSEQ MDS file,
whereas existing codes used to define RECODE are deleted from the MDS
file and so are unavailable for subsequent analyses; the initial SDS file is
unchanged.
Figure 10.2.╇ Resulting sequences when applying the RECODE and LUMP data
modification commands to the single-code event sequence shown and applying
CHAIN to the sequence resulting from the LUMP command.
and are specified in hundredths, you would enter 100 to indicate that you
wanted gaps of 1 second filled in.
codes could be named in ways that are more mnemonic or consistent or, after
creating a new code, you decide a different name would work better.
Second
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Existing data:
Cry,8-15 √ ++++++
WINDOW specifications:
(Cry √
Cry) √
Cry-3 √ +++++++++
Cry+3 √ +++++++++
(Cry-3,(Cry-1 √++
Cry)+1,Cry)+3 √++
(Cry+3 √+++
Cry)-3 √ +++
(Cry-4 √++++
Cry)+4 √++++
(Cry-2,(Cry+2 √++++
Cry)-2,Cry)+2 √ ++++
(Cry-4,Cry)+4 √++++++++++++++
Figure 10.3.╇ Existing data and WINDOW command specifications for new codes
anchored to onsets and offsets of the existing code. The names for the new codes
are not shown. A check mark indicates the onset of an event and a plus sign its
continuation.
probabilities are less than .05, we would ask instead which are less than
.0056 (.05/9). However, as Cohen (1990) has noted, in practice, the prob-
ability of type I error is almost always zero because effect sizes, even when
small, are almost never exactly zero. He argued that researchers are unlikely
to solve our “multiple tests problem with the Bonferroni maneuver” (p.
1304); for one thing, applied zealously, almost nothing is ever significant
(see also Wilkinson and the Task Force on Statistical Inference, 1999). Both
Cohen and Wilkerson recommend that investigators interpret overall pat-
terns of significant effects, not just individual ones; that they be guided by
predictions ordered from most to least important; and, above all, that they
focus on effect sizes. This advice has considerable merit.
Moreover, probabilities for individual tests are always approxima-
tions to some degree. This is certainly true for the present example of
nine adjusted residuals, none of which meet Haberman’s criteria for a
good approximation (no row sums are 30 or greater; see “Expected fre-
quencies and adjusted residuals” in Chapter 9). It is also true that none of
the approximate probabilities are less than the Bonferroni value of .0056
either€ – illustrating Cohen’s point. Nonetheless, something appears pat-
terned about the joint frequencies shown in Figure 9.2; but still noting
only which adjusted residuals are large is too piecemeal an approach. A
principled way of examining table cells, not individually piece by piece but
as a whole, would be welcome.
The counts in a contingency table form something of an interconnected
web€ – as values in one cell change, expected frequencies in all cells are
affected. For this reason, a whole-table approach to identifying deviant cells
makes considerable sense, especially as the number of cells becomes large.
We call our whole-table approach winnowing (Bakeman & Gottman, 1997;
Bakeman & Quera, 1995b; Bakeman, Robinson, & Quera, 1996; see also
Fagen & Mankovich, 1980, and Rechten & Fernald, 1978), but to explain
it requires introducing a couple of log-linear analysis concepts that we
develop further in the next chapter.
Winnowing is based on the familiar chi-square test of independence
(see “Indices of association for two-dimensional tables” in Chapter 9). For
this test, expected values are computed based on a model of row and col-
umn independence (as reflected in the usual erc = pc × xr+ formula). The
chi-square (using either χ2 or G2) is a goodness-of-fit test; smaller values,
which indicate less discrepancy between observed and expected frequen-
cies, indicate better fit. Substantively, we usually want to show that the row
and column factors are associated; that is, usually we want fit to fail and
so desire large values of chi-square€– ones with p values less than .05. In
130 Sequential Analysis and Observational Methods
contrast, to indicate fit, we want small values€– ones with p values greater
than .05 (i.e., values of chi-square less than the critical .05 value for the
appropriate degrees of freedom).
When some adjusted residuals are large, the omnibus chi-square for the
table will be large as well€– fit will fail. Winnowing attempts to determine
which cells are causing table fit to fail and how few cells we can ignore (i.e.,
replace with structural zeros) before achieving a model of independence
that fits (technically, a model of quasi-independence because it includes
structural zeros; see Wickens, 1989, pp. 251–253). Almost always, a fitting
model is achieved before all cells with large adjusted residuals are replaced,
meaning that interpretation can then focus on fewer cells€– adding any one
of them back would cause fit to fail significantly. If this process of removing
outlandish cells seems counterintuitive, think of it this way: To determine
who is responsible for a too-noisy room, we remove the loudest person first,
the next loudest second, and so forth, until reasonable quiet reigns.
Winnowing is an iterative process. We delete cells (i.e., replace them
with structural zeros) one by one until we find a model of quasi-indepen-
dence that fits. Winnowing can proceed either theoretically (delete cells in
a prespecified order until a fitting model is found) or empirically (at each
step, remove the cell with the largest absolute residual until a fitting model
is found). An alternative empirical method is to order cells based on the
absolute magnitude of the adjusted residuals from the initial model and
then to delete them in that order. Both empirical approaches are illustrated
in Figure 10.4.
The adjusted residuals for the model of independence€– [R][C] for rows
and columns using conventional log-linear notation€– are shown as Model
#1 at the top of Figure 10.4. With G2(4) = 13.4 and p = .01, this model fails to
fit. If we took an empirical approach, we would first delete the Fuss-Assure
cell. The resulting Model #2, with G2(3) = 6.7 and p = .09, fits. The Fuss-
Assure cell was responsible for failure to fit. Models #1 and #2 are related
hierarchically; the difference between their chi-squares is 6.7 (13.4€– 6.7)
and is distributed as chi-square with 1 degree of freedom (4€– 3 = 1, the
difference between their respective dfs). The value of 6.7 exceeds 3.84, the
.05 critical value for 1 df, thus adding the Fuss-Assure cell back to the model
causes goodness-of-fit to deteriorate significantly.
If Model #2 had not fit, again proceeding empirically, next we would
have deleted the Calm-Explain cell (the largest absolute adjusted residual
for #1) or the Cry-Explain cell (the largest absolute adjusted residual for #2);
see Models #3 and #4, respectively. Either would have resulted in a model
that fits (although #4 has smaller adjusted residuals overall); and, although
Preparing for Sequential and Other Analyses 131
2
Model (df,N) G p Assure Explain &
Figure 10.4.╇ Table-fit statistics and adjusted residuals for four models illustrating
winnowing; dashes indicate structural zeros.
both began with a fitting model (#2), both resulted in a significant increase
in goodness-of-fit (G2[1] for #2€– #3 = 6.7€– 2.7 = 4.0; and G2[1] for #2€– #4
= 6.7€– 0.67 = 6.0; both G2s > 3.84).
If we took a conceptual approach, and had theoretical reasons for want-
ing to delete first Calm-Explain and then Fuss-Assure, we would have dis-
covered that deleting the Calm-Explain cell resulted in a fitting model
(G2[3] = 7.0, p = .07), albeit marginally, but that deleting the Fuss-Assure
cell next resulted in a model that fit significantly better (Model #3 in Figure
10.4, hierarchical test G2[1] = 4.3, p < .05). (Winnowing is implemented in
our ILOG 3 program; see our GSEQ web pages for download).
summary
Data modification€– mainly creating new codes from existing codes€– offers
a flexible and powerful way to create variables that are particularly faithful
to our research questions; thus, it is a bit surprising that data modification
does not receive more attention. Data modifications are of several kinds.
Among the most useful are such logical combinations as AND and OR (for
timed-event, interval, and multicode event data), which let you define a
132 Sequential Analysis and Observational Methods
residual applied piecemeal to all cells courts type I error, yet a Bonferroni
correction may be too stringent. We recommend a log-linear-based
approach that we call winnowing; cells are replaced with structural zeros,
one by one, until a fitting model (of quasi-independence) is found, thus
identifying those “deviant” cells that caused fit to fail.
11
The phrase, sequential analysis€ – which appears in the title of this book
as well as earlier ones (Bakeman & Gottman, 1997; Bakeman & Quera,
1995a)€– admits to more than one meaning. In the context of microbiology,
it can refer to the description of genetic material. In the context of statistics,
it can mean sequential hypothesis testing€– that is, evaluating data as they
are collected and terminating the study in accordance with a predefined
stopping rule once significant results are obtained (Siegmund, 1985).
However, in the context of observational methods generally, and in the
context of this book specifically, sequential analysis refers to attempts to
detect patterns and temporal associations among behaviors within observa-
tional sessions. As such, sequential analysis is more a toolbox of techniques
than one particular technique. It can include any of a variety of techniques
that serve its goals. Some of these techniques have already been discussed
(e.g., “Contingency indices for 2â•›×â•›2 tables” in Chapter 9). The unifying fac-
tor is the data used; by definition, sequential analysis is based on sequen-
tial data€– data for which some sort of continuity between data points can
be assumed. Indeed, a common thread throughout this book has been the
description and use of the four basic sequential data types we defined in
Chapter 4€– single-code event, timed-event, interval, and multicode event
sequential data.
As we have emphasized in earlier chapters, such sequential data result
from coding or rating€ – that is, from nominal and occasionally ordinal
measurement. There is another kind of sequential data that does not appear
in these pages€ – not because we think it is unimportant, but because its
collection and analysis requires quite different approaches from those
described here. It is usually called time series data, is common especially in
such fields as economics and astronomy, and is characterized by a lengthy
series of numbers (often in the 100s and 1,000s) usually measured on a ratio
134
Time-Window and Log-Linear Sequential Analysis 135
Males Females
Figure 11.1.╇ Scores are mean odds ratios, n = 16 for males and 14 for females;
�analyses were conducted with log odds. For each row, means sharing a common
subscript do not differ significantly, p < .05, per Tukey post-hoc test.
generates probabilities for all possible outcomes using the binomial distri-
bution and so can determine exactly where in this distribution the observed
value lies (for details, see Bakeman & Robinson, 2005). Although the sign
test can analyze binary outcomes generally, in the following two paragraphs
we present an example showing how the sign test is useful specifically in the
context of sequential analysis.
Consider the Deckner et al. (2003) study described a few paragraphs
earlier. The group means and the results of the parametric analysis pre-
sented in Figure 11.1 answer questions about mean differences, but leave
other questions unanswered. For example, when the infants were 18 months
of age, how many tended to match their mothers? In other words, for how
many were the odds ratio over 1? We know that the means for both males
and females were less than 1, but the mean both summarizes and obscures
how individuals performed. In contrast, the sign test€– which requires that
we count individual cases€– makes it easy to highlight how individuals per-
formed, allowing us to report, for example, the percentage of individuals
with a particular outcome.
For the current example, odds ratios exceeded 1 for only 6 of the
30 18-month-olds (4 of 16 males and 2 of 14 females), which is signifi-
cant, p < .01 with either a one- or two-tailed sign test (separately by sex,
p€< .05 one-tailed for males, p < .05 two-tailed or p < .01 one-tailed for
females). A similar analysis indicates that odds ratios exceeded 1 for 7
of 16 30-month-old males and for 12 of 14 30-month-old females. The
effect for 30-month-olds overall and for 30-month-old males was not
significant, but the effect for 30-month-old females was (p < .05 two-
tailed or p < .01 one-tailed). As this example shows, the sign test not only
is useful analytically when evaluating contingency indices, but also pro-
vides a level of individually based descriptive detail that gets lost when
only group means are presented.
significance can seem a bit piecemeal. Log-linear analysis offers a more stat-
istically grounded, whole-table approach. Among the standard references
are Bishop, Fienberg, and Holland (1975), Fienberg (1980), and Haberman
(1978, 1979), although more accessible alternatives are Bakeman and
Robinson (1994), Kennedy (1992), and Wickens (1989).
Log-linear analysis can be regarded as an extension of the traditional
2-dimensional chi-square test of independence or association. And while
traditional chi-square analysis is limited to contingency tables of just two
dimensions, log-linear analysis can handle tables of more dimensions€– and
so can handle chains longer than just two events.
However sampled, each chain considered adds a tally to the m-way table.
For example, assume an interest in 3-event chains and three codes€– Assr,
Exp, and Cajole for assure, explain, and cajole€– that might be applied to
parents’ or medical professionals’ turns of talk. These codes can repeat, and
thus a 3-event chain could add a tally to any of the 9 cells of the 3â•›×â•›3â•›×â•›3,
Lag 0â•›×â•›Lag 1â•›×â•›Lag 2 contingency table. Specifically, an Assr-Exp-Exp chain
would add one tally to the middle cell in the table at the top left in Figure
11.2. However, when codes cannot repeat, some of the cells will be struc-
turally zero. For example, again assume an interest in 3-event chains, but a
different set of three codes€– Alert, Fuss, and Cry€– that might be applied
to a child’s state and that cannot repeat. Instead of 27 possible 3-event
sequences, now there are only 12€ – with the pattern of structural zeros
shown in Figure€11.2 (right). At first glance, the large number of structural
zeros might seem problematic, but in fact they are handled routinely by log-
linear analysis€– which is one strength of this approach when attempting to
detect chains in single-code event data.
Time-Window and Log-Linear Sequential Analysis 141
Model Deleted
2 2
Terms G df Term ∆G ∆df
[012] 0 0 — — —
[01][02][12] 7.7 8 [012] 7.7 8
[01][12] 11.5 12 [02] 3.9 4
**
[0][12] 47.0 16 [01] 35.5** 4
**
[0][1][2] 81.3 20 [12] 34.4** 4
Figure 11.3.╇ Log-linear analysis of the three-dimensional table shown on the left in
Figure 11.2 (codes can repeat).
** p < .01
the �association between Lag 1 and Lag 2 is moderated by (i.e., depends on)
what the Lag 0 term is.
For the counts given in Figure 11.2 (left), the generated counts for the
[01] [02] [12] model fit tolerably; the G2 of 7.7 with 8 df is small and not
statistically significant (see Figure 11.3). In other words, when we deleted
the saturated term, the G2 indicating model fit deteriorated from 0 to 7.7;
that is ΔG2 (the change in G2) was 7.7, which with 8 degrees of freedom is
not a significant change in fit.
Now we have a choice: Which two-way term should be deleted next? As
you would expect, this could be decided statistically€– delete first the term
that causes the smallest change in G2€– or conceptually. For conceptual rea-
sons, we decided to delete the [02] term. A model that included the [01] but
not the [12] term€– or vice versa€– would make little sense in a sequential
context; if a Lag 1 association exists, both terms would be necessary. The
G2 for this model€– [01][12]€– generated expected frequencies that still fit
reasonably (G2 = 11.5, df = 12)€– and ΔG2 = 3.9, which with 4 df was not
a significant change. If we had proceeded next to delete the [01] term, the
resulting model€– [0][12]€– would not fit, the ΔG2 of 35.5 with 4 df would
indicate a significant deterioration in fit, and moreover, as just noted, this
model would make little sense in a sequential context with the [01] but not
the [12] term. Thus we would select the [01][12] model to interpret.
The [01][12] model implies that when events at Lag 1 are taken into
account, events at Lag 0 and Lag 2 are not associated, but are in fact inde-
pendent. The ability to detect such conditional independence€– the independ-
ence of Lag 0 and Lag 2 conditional on Lag 1€– is a strength of the log-linear
approach applied to single-code event sequential data; it is a strength not
shared with more piecemeal approaches. (Conditional independence is
Time-Window and Log-Linear Sequential Analysis 143
Model Deleted
2 2
Terms G df Term ∆G ∆df
[01][02][12] 0 0 — — —
[01][12] – CFC 1.6 2 [02]–CFC 1.6 2
* *
[01][12] 10.8 3 [02] 9.2 1
Figure 11.4.╇ Log-linear analysis of the three-dimensional table shown on the right
in Figure 11.2 (codes cannot repeat).
* p < .05
symbolized 0╨2|1 by Wickens, 1989; see also Bakeman & Quera, 1995b.)
In other words, we conclude that two-event chain patterning characterizes
this particular sequence. Knowing that chains are not patterned at Lag 2,
we could then proceed to a Lag 0-Lag 1 follow-up analysis, using winnow-
ing. For these data, we would discover that when the Assure-Cajole and
the Cajole-Assure cells are deleted, the resulting model of quasi-indepen-
dence fits: G2(2, N = 149) = 0.02, p = .98. Apparently for these (generated,
not empirically collected) data, patterning consisted of Assure-Cajole and
Cajole-Assure transitions.
For our second example, consider the three-dimensional table on the
right in Figure 11.2, for which codes cannot repeat. This example illustrates
an additional advantage of the log-linear approach€– its ability to deal with
sequences when codes cannot repeat and their attendant structural zeros.
When the number of codes is three, the [01][02][12] model is completely
determined€– that is, both G2 and df are 0; in effect, it is the saturated model.
Removing the [02] term causes fit to fail: G2 = 10.8 for the resulting [01]
[12] model, which with 3 df is significant (i.e., significantly bad fit), and the
change in fit (ΔG2 = 10.8, df = 3) was likewise significant (compare first and
last line in Figure 11.4). We tentatively accept the [01][02][12] saturated
model. Unlike the previous analysis, in this case, the model of conditional
independence fails to fit and so we conclude that 3-event chain patterning
characterizes this particular sequence.
However, these data can be subject to winnowing (again, see “Deviant
cells, type I error, and winnowing” in Chapter 10). Imagine, for example,
that we have theoretical reasons to think that the Cry-Fuss-Cry chain is of
particular importance. To test its importance, we replace its count of 15
(see lower-right table in Figure 11.2) with a structural zero. As shown in
Figure 11.4 (middle line), the model of conditional independence with the
Cry-Fuss-Cry cell removed€– [01][12]€– CFC€– fits the data (G2 =1.6, df = 2).
But the [01][02] model with the Cry-Fuss-Cry cell replaced (bottom line)
144 Sequential Analysis and Observational Methods
fails to fit. The change in G2 from the [01][12] model with the structural
zero to one without the structural zero is significant. We conclude that the
�Cry-Fuss-Cry chain causes the failure of the [01][12] model to fit. Note,
however, that we chose the Cry-Fuss-Cry chain for theoretical reasons;
replacing other chains with structural zeros might also have resulted in a
[01][12] model that fit€– which simply underlines the importance of con-
ceptually guided data analysis.
The illustration of log-linear methods applied to exploring sequencing
in single-code event data presented in the previous several paragraphs
should be regarded as brief and introductory and in no way exhaustive.
If these techniques seem applicable to your work, we encourage you to
read further (e.g., Bakeman & Gottman, 1997; Bakeman & Quera, 1995b;
Wickens,€1989).
Model Deleted
2 2
Terms G df Term ∆G ∆df
[ADPR] 0 0 — — —
[ADP][AR][DR][PR] 8.5 4 [ADPR] 8.5 4
[ADP][DR][PR] 8.6 5 [AR] 0.1 1
[ADP][DR] 9.0 6 [PR] 0.4 1
** **
[ADP][R] 18.2 7 [DR] 9.2 1
Figure 11.6.╇ Log-linear analysis of the four-dimensional table for the data given in
Figure 11.5.
** p < .01
summary
Sequential analysis refers to attempts to detect patterns and temporal asso-
ciations among behaviors within observational sessions. As such, sequen-
tial analysis is more a toolbox of techniques than one particular technique.
Contingency tables are given particular attention. Time-based contingency
tables allow us to examine patterns of co-occurrence in timed-event data,
while lagged event-based contingency tables allow us to examine sequen-
tial patterns in single-code event data. Some sequential analytic approaches
and techniques have already been described in previous chapters.
Time-window sequential analysis offers a way to examine lagged asso-
ciations in timed-event data. You define a window of opportunity keyed to
a given behavior (e.g., the five seconds after a given behavior begins) and
then tally how often a target begins within such a window. The association
of a given window and a target onset can be summarized with a statistic
such as the odds ratio or Yule’s Q, computed for each session separately and
analyzed as appropriate.
The sign test (or binomial test) is a simple statistic for binary outcomes.
It makes few assumptions, provides useful descriptive detail, and lets
you determine whether, for example, a particular contingency occurred
Time-Window and Log-Linear Sequential Analysis 147
recurrence analysis
In this section we consider techniques that rely on whole sequences to
display patterns graphically. Exploring a sequence as a whole can pro-
vide new insight into any patterns that may exist, where in the sequence
they occur, and even whether they tend to repeat in different, but com-
parable, sequences. Moreover, such explorations can also be applied to two
sequences to determine whether certain codes tend to repeat in both, thus
revealing a possible synchronicity.
Eckmann, Kamphorst, and Ruelle (1987) proposed using a kind of
similarity map€– called a recurrence plot or dot plot€– to detect patterns and
structural changes in time series for quantitative variables that describe
the behavior of dynamic systems (e.g., weather, stock market). A recur-
rence plot is an array of dots arranged in a N×N square. Values for both
horizontal and vertical axes are associated with the successive values of a
time series of N elements. The color assigned to a dot (i.e., cell rc of the
N×N matrix, where r = 1.â•›.â•›.N, bottom-to-top, for the Y axis and c = 1.â•›.â•›.N,
left-to-right, for the X axis) depends on the similarity between the r-th
and c-th elements. Either a single color like black or different colors or
148
Recurrence Analysis and Permutation Tests 149
240
0.2
0
–0.2
–0.4
–0.6 210
1.5 1.6 1.7 1.8 1.9 2 2.1 2.2
Time (sec)
180
2
120
Time (sec)
1.9
90
1.8
60
1.7
1.6 30
1.5
1.5 1.6 1.7 1.8 1.9 2 2.1 2.2 30 60 90 120 150 180 210 240
Time (sec)
Adult utterance number
Figure 12.1.╇ Examples of recurrence plots. At left, a recurrence plot of the ECG
measurement of a heart beat (Marwan, 2003; retrieved from www.recurrence-plot.
tk). At right, a cross-recurrence plot of mother and infant utterances at 10 months-
of-age (from Buder et al., 2010). See text for details.
we he we he we he we hc wp he we he
we
he
we
he
we
he
we
hc
wp
he
we
he
Figure 12.2.╇ Two recurrence plots for a single-code event sequence of a couple’s
verbal interaction. For both plots, the sequence is represented top to bottom and
left to right. At left, each row (and each column) of the plot corresponds to one code
and similarities are all-or-nothing. At right, each row (and each column) corres-
ponds to a time window containing a chain of three successive codes; time windows
are shifted one event forward and are overlapped; similarities are quantitative and
represented with levels of grey.
may overlap or not, each providing slightly different plots. When time win-
dows are used, similarity is no longer all-or-nothing because quantitative
measures of similarity can be represented with different levels of gray (as
in Buder et al.’s, 2010, example). To illustrate, the verbal interaction data
shown at the bottom of Figure 12.2 creates a recurrence plot in which time
windows containing three codes were moved along the sequence (Figure
12.2, right). Successive windows were shifted one event forward and thus
overlapped; the first three windows started at the first, second, and third
events in the sequence€– i.e., [we he we], [he we he], and [we he we]. Gray
dots indicate that certain windows have a nonperfect similarity€– e.g., [ha
wa hp] and [ha wa ha].
In this case, repetitions of similar three-code chains occurred in differ-
ent parts of the sequence as indicated by diagonal segments that are parallel
to the main diagonal; notice the run of five diagonal black dots close to the
main diagonal in the lower right quarter of the plot. They correspond to
five successive windows of the section wa ha wa hp wa ha wa hp wa ha wa
in the last quarter of the sequence, which starts at position 67 and ends at
152 Sequential Analysis and Observational Methods
Figure 12.3.╇ Recurrence plots for a random event sequence (top) and a highly pat-
terned event sequence of verbal interactions (bottom). Left to right, plots indicate
time windows 1, 2, and 3 codes wide. See text for details.
Figure 12.4.╇ At bottom, a timed-event sequence of a child’s crying and fussing epi-
sodes, and at top, its recurrence plot. Contents for two pairs of windows with high
similarities are shown. See text for details.
intersections along those horizontal and vertical bands indicate high simi-
larities€– that is, similar repetitions of codes Cry, Calm, and Fuss at different
positions along the sequence.
A recurrence plot can be further processed to reveal the temporal struc-
ture of the sequence; for example, the RAP program can detect segments
in a sequence by filtering similarity values along and close to the diagonal.
Following Foote and Cooper (2003), a square matrix (smaller in size than
the plot itself) whose cells contain values from a two-dimensional normal
distribution (a Gaussian checkerboard filter) is moved along the diagonal,
centered on each of its dots; for every diagonal dot, its surrounding regions
are multiplied cell-wise by the filter and all the products are added, yielding
one single measure called a novelty score. Novelty scores are a time series
indicating where important changes in the sequence occur; by applying the
Recurrence Analysis and Permutation Tests 155
filter, abrupt changes in the sequence are highlighted. Figure 12.5 shows
a recurrence plot for a interval-recorded sequence of mother-infant inter-
action 558 intervals long; codes include approach, obey, instruct, complain,
and so on. The plot was generated by applying a moving time window two
intervals wide (successive windows overlapped by one interval); the result-
ing novelty score is shown at the top, its peaks indicating segment bound-
aries€– that is, temporal points at which significant changes were detected
in the sequence.
156 Sequential Analysis and Observational Methods
The intent of these examples has been to whet your appetite. Recurrence
analysis offers many more possibilities than the few illustrated here,
including meaningful summary measures of entire patterns. Once again,
interested readers are encouraged to read further (e.g., Marwan, Romano,
Thiel, & Kurths, 2007; Riley & Van Ordern, 2005; Zbilut & Webber,
2007).
exact test for 2×2 contingency tables (e.g., Hays, 1963). Given a short event
sequence like ACBACBACB (length N = 9), a permutation test for the Lag€1
BA transition would proceed in five steps as follows:
1. The observed transition frequency€– xBA€– is tallied; in this case, its
value is 2.
2. All possible permutations of the sequence are listed; this example
yields N! = 9! = 362,880 permuted sequences. One of them, of course,
is the sequence observed, and the simple code frequencies are the
same for all sequences.
3. For each permuted sequence the frequency of the BA transition€ –
xBA(s) where s = 1.â•›.â•›.N!€– is tallied. Values for xBA(s) can vary between
0 (for those permuted sequences in which A never follows B; e.g.,
ACBCABCAB) and 3 (for those in which A follows B three times,
which is the maximum possible given that B’s simple frequency is
3; e.g., CBACBACBA). The number of sequences that contain 0, 1,
2, and 3 BA transitions€– in this case, 86,400, 194,400, 77,760, and
4,320 or 23.8 percent, 53.6 percent, 21.4 percent, and 1.19 percent,
respectively€– constitute a sampling distribution for the number of
BA transitions expected by chance.
4. The distribution median€– mBA€– is computed; in this case its value
is€0.99.
5. If xBA > mBA, then the one-tailed exact p value for xBA = 2 is the pro-
portion of permuted sequences in which xBA ≥ 2; if xBA < mBA, then
the p value for xBA = 2 is the proportion of permuted sequences in
which xBA ≤ 2. In this case, the exact p value for the observed value of
2 is .226 (.214 + .012).
The procedure for computing a one-tailed exact p value for every possible
transition among codes A, B, and C in the ACBACBACB sequence is similar:
(1) The observed transition frequencies xrc are tallied (r,c = 1, 2, 3); (2) for
each permuted sequence, the frequency of every transition is tallied, xrc(s);
(3) One sampling distribution of N! values and one median, mrc, is obtained
for every cell (r,c) in the Lag 1 table; (4) The exact one-tailed p value for
cell (r,c) is then the proportion of values in its sampling distribution that
are equal to or greater than xrc (if xrc > mrc) or are equal to or less than xrc
(if xrc < mrc). Note that in the observed sequence, ACBCABCAB, no code
repeats, whereas in many of the N! permuted sequences codes may repeat
(e.g., ACBBCBCAA). When codes may repeat in the observed sequence, the
sampling distributions are constructed using the N! permuted sequences,
even if no code happened to repeat in the observed one. However, when
158 Sequential Analysis and Observational Methods
1 3 4 6 2 16
Chat
.099– .638+ .574– .055+ .659+
5 2 5 0 2 16
Write
.151+ .550– .266+ .031– .622+
5 0 9 4 1 19
Read
.426+ .011– .022+ .612+ .213–
5 2 2 3 3 15
Ask
.176+ .440– .163– .587+ .310+
0 6 0 2 2 10
Attentive
.069– .002+ .034– .656+ .357+
TOTAL 16 13 20 15 10 74
Figure 12.6.╇ The first number in each cell (top) is the observed count for 2-event
chains (i.e., Lag 1 transitions) computed for the single-code event sequence shown
at the bottom (N = 75). The second number in each cell is the exact p-value for each
2-event chain, estimated using sampled permutations. See text for details.
chi-square for this table is G2(16) = 40.96, with asymptotic p = .0006 (for
comparison, its exact p value as computed by PSEQ was .0221). Because
the chi-square indicates sequential association in the data, we decided
to probe further. Note that the number of possible permutations for this
sequence is N! = 75!, which is approximately 2.48·10109 (i.e., 248 followed
by 107 zeros). The sequence was shuffled 1,000 times, and a sampling dis-
tribution was obtained for each cell in the table. For each Lag 1 transition
(i.e., each cell in the table), Figure 12.6 shows€– in addition to its count€–
its one-tailed p value based on the sampling distribution. For example,
the observed count for the Attentive-Write transition is 6. Only 2 out of
the 1,000 shuffled sequences contained 6 or more Attentive-Write transi-
tions and so its estimated p value is .002 (see Figure 12.7). If the observed
count for this transition had been 5 instead, its estimated p value would
have been .014 (i.e., the probability of obtaining 5 or 6, which equals .012
+ .002). Other transitions with significant results were Write-Ask, Read-
Write, Read-Read, and Attentive-Read.
In Figure 12.6, probabilities above and below their medians are indi-
cated with plus and minus signs, indicating that the transition occurred
more or less often than expected by chance, respectively. In this case,
160 Sequential Analysis and Observational Methods
198
200
103
100 60
12 2
0
0 1 2 3 4 5 6
Number of Attentive-Write chains
Figure 12.7.╇ The sampling distribution for the Attentive-Write transition, based on
shuffling an event sequence (N = 75) 1,000 times. See text for details.
summary
In addition to approaches discussed in earlier chapters, two additional tech-
niques for detecting pattern are recurrence analysis and permutation tests.
Recurrence Analysis and Permutation Tests 161
Recurrence analysis is primarily graphic and applies to any of the data types
we have described. An entire sequence defines both the horizontal and the
vertical axes of a recurrence plot; units are events or time units or intervals
(or windows containing them). Cells are colored€– either black and white,
shades of gray, or different colors€ – depending on the similarity (various
definitions are possible) of each cell’s respective row and column codes. In
this way, patterns in the sequences€– for example, repeated runs of the same
short sequences€ – are revealed by patterns in the plot. Matching may be
revealed when horizontal and vertical axes represent different sequences
(cross-recurrence€– e.g., mother and infant). Moreover, meaningful sum-
mary measures of entire patterns can be derived from individual plots.
Permutation tests, the second approach, can detect patterns in rela-
tively short single-code event sequences. Such tests generate all possible
permutations of the observed sequence, create a sampling distribution for
the observed test statistic from the permuted sequences, and then deter-
mine the exact probability for the test statistic from this distribution. For
example, given a sequence of nine events€– each of which is coded A, B, or
C€– we can determine the exact probability of observing two A-B sequences
in the 9! permutations of the observed sequence of nine events.
The number of permutations can be very large€– N! where N is the length
of the sequence; as a result, constructing the sampling distribution can be
time consuming, even for relatively powerful computers. A solution is to
sample permutations instead of generating all possible permutations€ –
which results in an estimated p value instead of an exact one. Nonetheless,
satisfactory results can be obtained with as few as 1,000 samples (i.e., shuf-
fles of the observed sequence). Moreover, the procedure could be replicated
several times, which produces mean p values along with their 95 percent
confidence intervals. We recommend using sampled permutation tests
to identify significant Lag 1 transitions when event sequences are short;
because they require fewer assumptions, they engender greater confidence
than asymptotic tests.
Epilogue
163
Appendix A
(continued)
165
166 Appendix A
Note. Table entries indicate the expected value of kappa when comparing two observers, both
accurate at the indicated level, using a scheme with K codes. For example, minimum acceptable
value of kappa is .76 if you want 90% accuracy, K = 5, and code probabilities (prevalence) are
moderately variable. For details, see Bakeman et al. (1997).
Appendix B
(continued)
167
168 Appendix B
Note. Table entries indicate the expected value of kappa when comparing an observer accurate
at the indicated level with a gold standard, using a scheme with K codes. For example, minimum
acceptable value of kappa is .87 if you want 90% accuracy, K = 5, and code probabilities (prevalence)
are moderately variable. For details, see Bakeman et al. (1997).
References
169
170 References
Bakeman, R., Adamson, L. B., Konner, M., & Barr, R. (1990). !Kung infancy: The
social context of object exploration. Child Development, 61, 794–809.
Bakeman, R., & Brownlee, J. R. (1980). The strategic use of parallel play: A sequen-
tial analysis. Child Development, 51, 873–878.
â•… (1982). Social rules governing object conflicts in toddlers and preschoolers. In K.
H. Rubin & H. S. Ross (Eds.), Peer relationships and social skills in childhood
(pp. 99–111). New York: Springer-Verlag.
Bakeman, R., Deckner, D. F., & Quera, V. (2005). Analysis of behavioral streams.
In D. M. Teti (Ed.), Handbook of research methods in developmental science
(pp.€394–420). Oxford: Blackwell Publishers.
Bakeman, R., & Dorval, B. (1989). The distinction between sampling independence
and empirical independence in sequential analysis. Behavioral Assessment, 11,
31–37.
Bakeman, R., & Gottman, J. M. (1986). Observing interaction: An introduction to
sequential analysis. Cambridge: Cambridge University Press.
â•… (1997). Observing interaction: An introduction to sequential analysis (2nd ed.).
Cambridge: Cambridge University Press.
Bakeman, R., & Helmreich, R. (1975). Cohesiveness and performance: Covariation
and causality in an undersea environment. Journal of Experimental Social
Psychology, 11, 478–489.
Bakeman, R., & Quera, V. (1992). SDIS: A sequential data interchange standard.
Behavior Research Methods, Instruments, and Computers, 24, 554–559.
â•… (1995a). Analyzing interaction: Sequential analysis with SDIS and GSEQ.
Cambridge: Cambridge University Press.
â•… (1995b). Log-linear approaches to lag-sequential analysis when consecutive
codes may and cannot repeat. Psychological Bulletin, 118, 272–284.
â•… (2008). ActSds and OdfSds: Programs for converting INTERACT and The
Observer data files into SDIS timed-event sequential data files. Behavior
Research Methods, 40, 869–872.
â•… (2009). GSEQ 5 [Computer software and manual]. Retrieved from www.gsu.
edu/~psyrab/gseq/gseq.html
â•… (2012). Behavioral observation. In H. Cooper (Ed.-in-Chief), P. Camic, D. Long,
A. Panter, D. Rindskopf, & K. J. Sher (Assoc. Eds.), APA handbooks in psych-
ology: Vol. 1. APA handbook of research methods in psychology: Psychological
research: Foundations, planning, methods, and psychometrics. Washington, DC:
American Psychological Association.
Bakeman, R., Quera, V., & Gnisci A. (2009). Observer agreement for timed-event
sequential data: A comparison of time-based and event-based algorithms.
Behavior Research Methods, 41, 137–147.
Bakeman, R., Quera, V., McArthur, D., & Robinson, B. F. (1997). Detecting
sequential patterns and determining their reliability with fallible observers.
Psychological Methods, 2, 357–370.
Bakeman, R., & Robinson, B. F. (1994). Understanding log-linear analysis with ILOG:
An interactive approach. Hillsdale, NJ: Lawrence Erlbaum Associates.
â•… (2005). Understanding statistics in the behavioral sciences. Mahwah, NJ: Lawrence
Erlbaum Associates.
References 171
Bakeman, R., Robinson, B. F., & Quera, V. (1996). Testing sequential association:
Estimating exact p values using sampled permutations. Psychological Methods,
1, 4–15.
Barker, R. G. (1963). The stream of behavior: Explorations of its structure and con-
tent. New York: Appleton-Century-Crofts.
Barker, R. G., & Wright, H. (1951). One boy’s day: A specimen record of behavior.
New York: Harper.
Bass, R. F., & Aserlind, L. (1984). Interval and time-sample data collection proce-
dures: Methodological issues. Advances in Learning and Behavioral Disabilities,
3, 1–9.
Becker, M., Buder, E., Bakeman, R., Price, M., & Ward, J. (2003). Infant response
to mother call patterns in Otolemur garnettii. Folia Primatologica, 74,
301–311.
Bekoff, M. (1979). Behavioral acts: Description, classification, ethogram ana-
lysis, and measurement. In R. B. Cairns (Ed.), The analysis of social interac-
tions: Methods issues, and illustrations (pp. 67–80). Hillsdale, NJ: Lawrence
Erlbaum€Associates.
Belsky, J., & Most, R. K. (1981). From exploration to play: A cross-sectional study of
infant free play behavior. Developmental Psychology, 17, 630–639.
Berk, R. A. (1979). Generalizability of behavioral observations: A clarification of
interobserver agreement and interobserver reliability. American Journal of
Mental Deficiency, 83, 460–472.
Bernard, C. (1927). An introduction to the study of experimental medicine. New
York: Macmillan. [Introduction a l’étude de la médecine experimentale. Paris:
J.-P.€Baillière, 1865].
Bishop, Y. M. M., Fienberg, S. R., & Holland, P. W. (1975). Discrete multivariate ana-
lysis: Theory and practice. Cambridge, MA: MIT Press.
Boice, R. (1983). Observational skills. Psychological Bulletin, 93, 3–29.
Brennan, R. L., & Kane, M. T. (1977). An index of dependability for mastery tests.
Journal of Educational Measurement, 14, 277–289.
Buder, E. H., Warlaumont, A. S., Oller, D. K., & Chorna, L. B. (May, 2010). Dynamic
indicators of mother-infant prosodic and illocutionary coordination.
Proceedings of Speech Prosody 2010, Chicago, IL.
Castellan, N. J., Jr. (1992). Shuffling arrays: Appearances may be deceiving. Behavior
Research Methods, Instruments, and Computers, 24, 72–77.
Chorney, J. M., Garcia, A. M., Berlin, K., Bakeman, R., & Kain, Z. N. (2010). Time-
window sequential analysis: An introduction for pediatric psychologists.
Journal of Pediatric Psychology, 35, 1060–1070. doi: 10.1093/jpepsy/jsq022.
Cochran W. G. (1954). Some methods for strengthening the common χ2 tests.
Biometrics, 10, 417–451.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and
Psychological Measurement, 20, 37–46.
â•… (1968). Weighted kappa: Nominal scale agreement with provision for scaled dis-
agreement or partial credit. Psychological Bulletin, 70, 213–220.
â•… (1977). Statistical power analysis for the behavioral sciences (revised edition). New
York: Academic Press.
172 References
â•… (1990). Things I have learned (so far). American Psychologist, 45, 1304–1312.
â•… (1994). The earth is round (p < .05). American Psychologist, 49, 997–1003.
Cohn, J. F. & Kanade, T. (2007). Use of automated facial image analysis for meas-
urement of emotion expression. In J. A. Coan & J. J. B. Allen (Eds.), Handbook
of emotion elicitation and assessment. Oxford University Press Series in Affective
Science (pp. 222–238). New York: Oxford.
Cohn, J. F., & Sayette, M. A. (2010). Spontaneous facial expression in a small group
can be automatically measured: An initial demonstration. Behavioral Research
Methods, 42, 1079–1086.
Cooper, H. (Ed.-in-Chief), Camic, P., Long, D., Panter, A., Rindskopf, D., & Sher,
K. J. (Assoc. Eds.). (2012). APA handbooks in psychology: Vol. 3. APA hand-
book of research methods in psychology: Data analysis and research publication.
Washington, DC: American Psychological Association.
Cooper, M., & Foote, J. (2002). Automatic music summarization via similarity
analysis. Proceedings of the International Symposium on Music Information
Retrieval, 81–85.
Cote, L. R., Bornstein, M. H., Haynes, O. M., & Bakeman, R. (2008). Mother-infant
person- and object-directed interactions in Latino immigrant families: A com-
parative approach. Infancy, 13, 338–365.
Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependabil-
ity of behavioral measures. New York: Wiley.
Dale, R., & Spivey, M. J. (2006). Unraveling the dyad: Using recurrence analysis to
explore patterns of syntactic coordination between children and caregivers in
conversation. Language Learning, 56, 391–430.
Deckner, D. F., Adamson, L. B., & Bakeman, R. (2003). Rhythm in mother-toddler
interactions. Infancy, 4, 201–217.
Dijkstra, W., & Taris, T. (1995). Measuring the agreement between sequences.
Sociological Methods and Research, 24, 214–231.
Douglass, W. (1760). A summary, historical and political, of the first planting, pro-
gressive improvements, and present state of the British settlements in North-
America (Vol. 1). London: R. and J. Dodsley.
Durbin, R., Eddy, S., Krogh, A., & Mitchison, G. (1998). Biological sequence ana-
lysis: Probabilistic models of proteins and nucleic acids. Cambridge: Cambridge
University Press.
Eckmann, J.-P., Kamphorst, S. O., & Ruelle, D. (1987). Recurence plots of dynamical
systems. Europhysics Letters, 5, 973–977.
Edgington, E. S., & Onghena, P. (2007). Randomization tests (4th ed.). Boca Raton,
FL: Chapman and Hall/CRC.
Ekman, P. W., & Friesen, W. (1978). Facial Action Coding System: A technique for the
measurement of facial movement. Palo Alto, CA: Consulting Psychologist Press.
Fagen, R. M., & Mankovich, N. J. (1980). Two-act transitions, partitioned con-
tingency tables, and the ‘significant cells’ problem. Animal Behaviour, 28,
1017–1023.
Fienberg, E. S. (1980). The analysis of cross-classified categorical data. Cambridge,
MA: MIT Press.
Fleiss, J. L. (1981). Statistical methods for rates and proportions (2nd ed.). New York:
Wiley.
References 173
Fleiss, J. L., & Cohen, J. (1973). The equivalence of weighted kappa and the intraclass
correlation coefficient as measures of reliability. Educational and Psychological
Measurement, 33, 613–619.
Fleiss, J. L., Cohen, J., & Everitt, B.S. (1969). Large sample standard errors of kappa
and weighted kappa. Psychological Bulletin, 72, 323–327.
Foote, J., & Cooper, M. (2003). Media segmentation using self-similarity decom-
position. Proceedings of the Society of Photo-Optical Instrumentation Engineers
(SPIE), 5021, 167–175.
Fossey, D. (1972). Vocalizations of the mountain gorilla (Gorilla gorilla beringei).
Animal Behaviour, 20, 36–53.
Gardner, W. (1995). On the reliability of sequential data: Measurement, meaning,
and correction. In J. M. Gottman (Ed.), The analysis of change (pp. 339–359).
Hillsdale, NJ: Lawrence Erlbaum Associates.
Galisson, F. (2000). Introduction to computational sequence analysis. Tutorial, ISMB
2000, 8th International Conference on Intelligent Systems for Molecular
Biology, August, San Diego, CA. Available at www.iscb.org/ismb2000/tutor-
ial_pdf/galisson4.pdf
Goodenough, F. (1928). Measuring behavior traits by means of repeated short sam-
ples. Journal of Juvenile Research, 12, 230–235.
Goodman, S. H., Thompson, S. F., Rouse, M. H., & Bakeman, R. (2010). Extending
models of sensitive parenting of infants to women at risk for perinatal depression.
Unpublished manuscript.
Gottman, J. M. (1979). Marital interaction: Experimental investigations. New York:
Academic Press.
â•… (1980). On analyzing for sequential connection and assessing interobserver reli-
ability for the sequential analysis of observational data. Behavioral Assessment,
2, 361–368.
â•… (1981). Time-series analysis: A comprehensive introduction for social scientists.
Cambridge: Cambridge University Press.
Gottman, J. M., & Roy, A. K. (1990). Sequential analysis: A guide for behavioral
research. Cambridge: Cambridge University Press.
Gros-Louis, J., West, M. J., Goldstein, M. H., & King, A. P. (2006). Mothers provide
differential feedback to infants’ prelinguistic sounds. International Journal of
Behavioral Development, 30, 509–516.
Haberman, S. J. (1978). Analysis of qualitative data (Vol. 1). New York: Academic
Press.
â•… (1979). Analysis of qualitative data (Vol. 2). New York: Academic Press.
Haccou, P., & Meelis, E. (1992). Statistical analysis of behavioural data: An approach
based on time-structured models. Oxford: Oxford University Press.
Haddock, C., Rindskopf, D., & Shadish, W. (1998). Using odds ratios as effect sizes
for meta-analysis of dichotomous data: A primer on methods and issues.
Psychological Methods, 3, 339–353.
Hall, S., & Oliver, C. (1997). A graphical method to aid the sequential analysis of
observational data. Behavior Research Methods, Instruments, and Computers,
29, 563–573.
Hartmann, D. P. (1982). Assessing the dependability of observational data. In
D. P. Hartmann (Ed.), Using observers to study behavior: New directions for
174 References
methodology of social and behavioral science (No. 14, pp. 51–65). San Francisco:
Jossey-Bass.
Hays, W. L. (1963). Statistics (1st ed.). New York: Holt, Rinehart, & Winston.
Helfman, J. I. (1996). Dotplot patterns: A literal look at pattern languages. Theory
and Practice of Object Systems, 2, 31–41.
Hutt, S. J., & Hutt, C. (1970). Direct observation and measurement of behaviour.
Springfield, IL: Thomas.
Kaye, K. (1980). Estimating false alarms and missed events from interobserver
agreement: A rationale. Psychological Bulletin, 88, 458–468.
Kennedy, J. J. (1992). Analyzing qualitative data: Log-linear analysis for behavioral
research (2nd ed). New York: Praeger.
Konner, M. J. (1976). Maternal care, infant behavior, and development among
the !Kung. In R. B. DeVore (Eds.), Kalahari hunter-gathers (pp. 218–245).
Cambridge, MA: Harvard University Press.
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for
categorical data. Biometrics, 33, 159–174.
Maclure, M., & Willett, W. C. (1987). Misinterpretation and misuse of the kappa
statistic. American Journal of Epidemiology, 126, 161–169.
Mann, J., Haave, T. T., Plunkett, J. W., & Meisels, S. J. (1991). Time sampling: A
methodological critique. Child Development, 62, 227–241.
Martin, P., & Bateson, P. (2007). Measuring behaviour: An introductory guide (3rd
ed.). Cambridge: Cambridge University Press.
Maizel, J. V., Jr., & Lenk, R. P. (1981). Enhanced graphic matrix analysis of nucleic
acid and protein sequences. Proceedings of the National Academy of Sciences,
78, 7665–7669.
Mannila, H., & Ronkainen, P. (1997). Similarity of event sequences. In Proceedings of
the Fourth International Workshop on Temporal Representation and Reasoning.
TIME’97 (p. 136–139). Daytona Beach, Florida.
Mannila, H., & Seppänen, J. (2001). Recognizing similar situations from event
sequences. In Proceedings of the First SIAM Conference on Data Mining,
Chicago. Available at www.cs.helsinki.fi/~mannila/postscripts/mannilasep-
panensiam.pdf.
Marwan, N. (2003). Encounters with neighbours€– Current developments of concepts
based on recurrence plots and their applications. Ph.D. Thesis, University of
Potsdam, ISBN 3–00–012347–4.
Marwan, N., Romano, M. C., Thiel, M., & Kurths, J. (2007). Recurrence plots for
the analysis of complex systems. Physics Reports, 438, 237–329. doi:10.1016/j.
physrep.2006.11.001.
McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass
correlation coefficients. Psychological Methods, 1, 30–46.
Mehta, C., & Patel, N. (1992). StatXact: Statistical software for exact nonparametric
inference. Cambridge, MA: Cytel Software Corporation.
Messinger, D. S., Mahoor, M. H., Chow, S., & Cohn, J. F. (2009). Automated meas-
urement of facial expression in infant–mother interaction: A pilot study.
Infancy, 14, 285–305.
Miller, R. G., Jr. (1966). Simultaneous statistical inference. New York: McGraw-
Hill.
References 175
White, D. P., King, A. P., & Duncan, S. D. (2002). Voice recognition technology
as a tool for behavioral research. Behavior Research Methods, Instruments, &
Computers, 34, 1–5.
Wickens, T. D. (1989). Multiway contingency tables analysis for the social sciences.
Hillsdale, NJ: Lawrence Erlbaum Associates.
Wiggins, J. S. (1973). Personality and prediction. Reading, MA: Addison-Wesley.
Wilkinson, L., & Task Force on Statistical Inference (1999). Statistical methods in
psychology journals: Guidelines and explanations. American Psychologist, 54,
594–604.
Wolff, P. (1966). The causes, controls, and organization of the neonate. Psychological
Issues, 5 (whole No. 17).
Yoder, P., & Symons, F. (2010). Observational measurement of behavior. New York:
Springer.
Yoder, P. J., & Tapp, J. (2004). Empirical guidance for time-window sequential ana-
lysis of single cases. Journal of Behavioral Education, 13, 227–246.
Zbilut, J. P., & Webber, C. L., Jr. (2007). Recurrence quantification analysis:
Introduction and historical context. International Journal of Bifurcation and
Chaos, 17, 3477–3481.
Index
179
180 Index
computer-assisted coding systems, 37, 54, 124. event data, 72. See€multicode, single-code,
See€Mangold INTERACT, Noldus The state, timed-event
Observer event recording, 26
conditional independence in log-linear event-based agreement, 72, 78
analysis, 142 exact p values, 156
conditional probability, 108 exclusive offset times, 46
confidence intervals. See€odds ratio exhaustive. See€mutually exclusive and
confusion matrix, 60 exhaustive
context codes, SDIS, 49 expected frequency, 109
contingency indices. See€log odds, odds ratio, experimental studies, 3
Yule’s Q export files, 127
co-occurrence, 106
correlational studies, 3 factors, 4
criterion-referenced ICC, 87, 90 between-subjects, 4
Cronbach’s internal consistency alpha, 91 in SDIS, 48
cross-recurrence plot, 150 pooling over. See€pooling
within-subjects, 5
data management, 54 file formats, 55
data modification commands, 119. See€AND, files. See€export, MDS, SDS, tab-delimited
BOUT, CHAIN, EVENT, LUMP, Fossey, Dian, gorilla vocalization codes, 23
NOR, NOT, OR, RECODE, REMOVE, frames, number per second, 38, 45
RENAME, WINDOW, XOR frequency, 95
data modification, benefits of, 118, 124 for interval or multicode data, 96
data reduction, 43, 93 for single-code or timed-event data, 96
data transformations and recodes, 128 relative. See€relative frequency
data types in SDIS, 44. See€interval, multicode,
single-code, state, timed-event G2 difference test, 130, 141
declaration, SDIS, 46 gap
degrees of freedom mean between event onsets, 100
for chi-square, 111 mean between events, 100
in log-linear analysis, 141 min and max, 100
deviant cells, 128 Generalized Sequential Querier. See€GSEQ
digital recording, advantages of, 37 gold standard, 57, 68, 75
dot plot. See€recurrence plot advantages, 68, 69
duration, 97 disadvantages, 68
for interval and multicode data, 97 GSEQ, 44, 118
for single-code event data, 97
for timed-event data, 97 Haccou & Meelis, alignment algorithm, 80
relative. See€relative duration hierarchical rule
dynamic programming, 73 when coding, 16
when tallying, 106
Ekman, Paul, facial action coding system hypergeometric distribution, 156
(FACS), 20
embedding dimension in recurrence plots, 149 ICC, 58, 87
empirical zeros, 76 and weighted kappa, 82
episode for interval and multicode data, 96, formulas for, 91
99, 100, 102 models for, 90
estimated duration for interval data, 97 reliability sample, 88
event standards for, 91
J. Altmann’s defintion. See€Altmann vs. kappa, 58
onset and offset times, 48 vs. r, 88
EVENT command, 121 inclusive offset times, 46
Index 181