CiML v5 Book PDF
CiML v5 Book PDF
Microtome Publishing
Brookline, Massachusetts
www.mtome.com
Causality in Time Series
Challenges in Machine Learning, Volume 5
Florin Popescu and Isabelle Guyon, editors
Nicola Talbot, production editor
ISBN-13: 978-0-9719777-5-4
Causality Workbench
hhttp://clopinet.com/causalityii
Foreword
The topic of causality has been subject to a lengthy academic debate in Western sci-
ence and philosophy, as it forms the linchpin of systematic scientific explanations of
nature and the basis of rational economic policy. The former dates back to Aristotle’s
momentous separation of inductive and deductive reasoning – as the inductive reason-
ing has lacked tools (statistics) to support its conclusions on a formal, objective basis,
it has long taken a backseat to the rigor of logical deductive reasoning. Despite the
20th century rise to prominence of statistics, initially intended to resolve causal quan-
daries in agricultural and industrial process refinement, the field of statistical causal
inference is relatively young. Although its pioneers have received wide praise (Clive
Granger receiving the Nobel Prize and Judea Pearl receiving the ACM Turing Award)
the methods they have developed are not yet widely known and are still subject to re-
finement. Although one of the least controversial necessary criterion of establishing a
cause-effect is temporal precedence, this type of inference does not require time infor-
mation – rather, it aims to establish possible causal relations among observations on
other (logical) grounds based on conditional independence testing. The work of Clive
Granger, built upon the 20th century development of time series modeling in engi-
neering and economics, with some input from physiology, leads to a framework which
admittedly does not allow us to identify causality unequivocally.
At the time of the Neural Information Processing Systems (NIPS 2009) Mini-
Symposium on Time Series Causality (upon which this volume is based), there had
been scant interaction among the Machine Learning researchers who undertake the an-
nual pilgrimage to NIPS and the economists, engineers and neuro-physiologists who
not only require causal inference methods, but also help develop them. Following the
highly successful 2008 NIPS Causality Workshop (organized by Isabelle Guyon and
featuring, among others, Judea Pearl), it was decided to follow-up with a symposium
the following year aiming precisely to present related work by non- ’machine learn-
ers’ to this community . The symposium presented current state-of-the-art and helped
suggest future means of cross-disciplinary collaboration, while also featuring a tribute
to the work of the late Clive Granger by his former friend and colleague Hal White.
This work therefore presents an interdisciplinary exposition of both methodological
challenges and recent innovations.
The chapter of White and Chalak presents a detailed formal exposition of causal
inference in econometrics and, very importantly, provides the long awaited link be-
tween the time-series causality work of Clive Granger (based on relative information
of the present/past states of a time series pair) and the Pearl-type inference based on
conditional information (focused on triads or three-way dependence among variables),
as well as a practical exposition of a testing procedure that expands the classical errors
i
Foreword
ii
Foreword
Acknowledgments
We would like to thank Guido Nolte for his support and fruitful discussions. We would
also like to thank Pascal2 EU Network of Excellence and the NIPS foundation for
supporting the NIPS Mini-symposium on Time Series Causality.
iii
Foreword
iv
Preface
This book reprints papers of the Mini Symposium on Causality in Time Series, which
was part of the Neural Information Processing Systems 2009 (NIPS 2009), Decem-
ber 10, 2009, Vancouver, Canada. The papers were initially published on-line in JMLR
Workshop and Conference proceedings (JMLR W&CP), Volume 12: http://jmlr.
csail.mit.edu/proceedings/papers/v12/.
Florin Popescu
Fraunhofer Institute FIRST, Berkin
Florin.popescu@first.fraunhofer.de
Isabelle Guyon
Clopinet, California, USA
guyon@clopinet.com
v
Preface
vi
Table of Contents
Foreword i
Preface v
Linking Granger Causality and the Pearl Causal Model with Settable Systems 107
H. White, K. Chalak & X. Lu; JMLR W&CP 12:1–29 .
vii
viii
JMLR: Workshop and Conference Proceedings 12:115–139, 2011 Causality in Time Series
Abstract
The Causality Workbench project is an environment to test causal discovery algo-
rithms. Via a web portal (http://clopinet.com/causality), it provides a
number of resources, including a repository of datasets, models, and software pack-
ages, and a virtual laboratory allowing users to benchmark causal discovery algo-
rithms by performing virtual experiments to study artificial causal systems. We reg-
ularly organize competitions. In this paper, we describe what the platform offers for
the analysis of causality in time series analysis.
Keywords: Causality, Benchmark, Challenge, Competition, Time Series Prediction.
1. Introduction
Uncovering cause-effect relationships is central in many aspects of everyday life in both
highly industrialized and developing countries: what affects our health, the economy,
climate changes, world conflicts, and which actions have beneficial effects? Estab-
lishing causality is critical to guiding policy decisions in areas including medicine and
pharmacology, epidemiology, climatology, agriculture, economy, sociology, law en-
forcement, and manufacturing.
One important goal of causal modeling is to predict the consequences of given ac-
tions, also called interventions, manipulations or experiments. This is fundamentally
different from the classical machine learning, statistics, or data mining setting, which
focuses on making predictions from observations. Observations imply no manipula-
tion on the system under study whereas actions introduce a disruption in the natural
functioning of the system. In the medical domain, this is the distinction made between
“diagnosis” and “prognosis” (prediction from observations of diseases or disease evolu-
tion) and “treatment” (intervention). For instance, smoking and coughing might be both
predictive of respiratory disease and helpful for diagnosis purposes. However, if smok-
ing is a cause and coughing a consequence, acting on the cause (smoking) can change
your health status, but not acting on the symptom or consequence (coughing). Thus it
2
Causality Workbench
studies where many samples are drawn at a given point in time. Thus, sometimes the
reference to time in Bayesian networks is replaced by the notion of “causal ordering”.
Causal ordering can be understood as fixing a particular time scale and considering only
causes happening at time t and effects happening at time t + δt, where δt can be made
as small as we want. Within this framework, causal relationships may be inferred from
data including no explicit reference to time. Causal clues in the absence of temporal
information include conditional independencies between variables and loss of informa-
tion due to irreversible transformations or the corruption of signal by noise (Sun et al.,
2006; Zhang and Hyvärinen, 2009).
In seems reasonable to think that temporal information should resolve many causal
relationship ambiguities. Yet, the addition of the time dimension simplifies the problem
of inferring causal relationships only to a limited extend. For one, it reduces, but does
not eliminate, the problem of confounding: A correlated event A happening in the past
of event B cannot be a consequence of B; however it is not necessarily a cause because
a previous event C might have been a “common cause” of A and B. Secondly, it opens
the door to many subtle modeling questions, including problems arising with modeling
the dynamic systems, which may or may not be stationary. One of the charters of
our Causality Workbench project is to collect both problems of practical and academic
interest to push the envelope of research in inferring causal relationships from time
series analysis.
3. A Virtual Laboratory
Methods for learning cause-effect relationships without experimentation (learning from
observational data) are attractive because observational data is often available in abun-
dance and experimentation may be costly, unethical, impractical, or even plain impos-
sible. Still, many causal relationships cannot be ascertained without the recourse to
experimentation and the use of a mix of observational and experimental data might be
more cost effective. We implemented a Virtual Lab allowing researchers to perform
experiments on artificial systems to infer their causal structure. The design of the plat-
form is such that researchers can submit new artificial systems for others to experiment,
experimenters can place queries and get answers, the activity is logged, and registered
users have their own virtual lab space. This environment allows researchers to test
computational causal discovery algorithms and, in particular, to test whether modeling
assumptions made hold in real and simulated data.
We have released a first version http://www.causality.inf.ethz.ch/
workbench.php. We plan to attach to the virtual lab sizeable realistic simulators
such as the Spatiotemporal Epidemiological Modeler (STEM), an epidemiology sim-
ulator developed at IBM, now publicly available: http://www.eclipse.org/
stem/. The virtual lab was put to work in a recent challenge we organized on the
problem of “Active Learning” (see http://clopinet.com/al). More details on
the virtual lab are given in the appendix.
3
Guyon Statnikov Aliferis
4. A Data Repository
Part of our benchmarking effort is dedicated to collecting problems from diverse appli-
cation domains. Via the organization of competitions, we have successfully channeled
the effort or dozens of researchers to solve new problems of scientific and practical
interest and identified effective methods. However, competition without collaboration
is sterile. Recently, we have started introducing new dimensions to our effort of re-
search coordination: stimulating creativity, collaborations, and data exchange. We
are organizing regular teleconference seminars. We have created a data repository
for the Causality Workbench already populated by 15 datasets. All the resources,
which are the product of our effort, are freely available on the Internet at http:
//clopinet.com/causality. The repository already includes several time se-
ries datasets, illustrating problems of practical and academic interest (see table 1):
The donor of the dataset NOISE (Guido Nolte) received the best dataset award. The
reviewers appreciated that the task includes both real data from EEG time series and
artificial data modeling EEG. We want to encourage future data donors to move in this
direction.
4
Table 1: Time dependent datasets.“TP” is the data type, “NP” the number of participants who returned results and “V” the number of
views as of January 2011. The semi-artificial datasets are obtained from simulators of real tasks. N is the number of variables,
T is the number of time samples (not necessarily evenly spaced) and R the number of simulations with different initial states
or conditions.
Name Size Description Objective
(TP; NP; V)
MIDS T=12 sampled values in time Mixed Dynamic Systems. Simulated Use the training data to build a
(Artificial; (unevenly spaced); R=10000 time-series based on linear Gaussian models model able to predict the
NA; 794) simulations. N=9 variables. with no latent common causes, but with effects of manipulations on the
multiple dynamic processes. system in test data.
NOISE Artificial: T=6000 time points; Real and simulated EEG data. Learning Artificial task: find the causal
(Real + R=1000 simul.; N=2 var. causal relationships using time series when dir. in pairs of var.
artificial; NA; Real: R=10 subjects. noise is corrupting data causing the classical Real task: Find which brain
783) T#200000 points sampled at Granger causality method to fail. region influence each other.
256Hz. N=19 channels.
PROMO T=365*3 days; R=1 simulation; Simulated marketing task. Daily values of Predict a 1000x100 boolean
(Semi- N=1000 promotions + 100 1000 promotions and 100 product sales for matrix of causal influences of
artificial; 3; products. three years incorporating seasonal effects. promotions on product sales.
1601)
SEFTI R=4000 manufacturing lots; Semiconductor manufacturing. Each Find the tools that are guilty of
(Semi- T=300 async. operations (pair wafer undergoes 300 steps each involving performance degradation and
artificial; NA; of values {one of N=25 tool one of 25 tools. A regression problem for eventual interactions and
908) IDs, date of proc.}) + cont. quality control of end-of-line circuit influence of time.
target (circuit perf. for each lot). performance.
SIGNET T=21 asynchronous state Abscisic Acid Signaling Network. Model Determine the set of 43 boolean
(Semi-artif.; updates; R=300 pseudodynamic inspired by a true biological signaling rules that describe the network.
2; 2663) simulations; N=43 rules. network.
5
Causality Workbench
Guyon Statnikov Aliferis
of new algorithms. The results indicated that causal discovery from observational data
is not an impossible task, but a very hard one and pointed to the need for further re-
search and benchmarks (Guyon et al., 2008). The Causal Explorer package (Aliferis
et al., 2003), which we had made available to the participants and is downloadable as
shareware, proved to be competitive and is a good starting point for researchers new
to the field. It is a Matlab toolkit supporting “local” causal discovery algorithms, effi-
cient to discover the causal structure around a target variable, even for a large number
of variables. The algorithms are based on structure learning from tests of conditional
independence, as all the top ranking methods in this first challenge.
The first challenge (Guyon et al., 2008) explored an important problem in causal
modeling, but is only one of many possible problem statements. The second chal-
lenge (Guyon et al., 2010) called “competition pot-luck” aimed at enlarging the scope of
causal discovery algorithm evaluation by inviting members of the community to submit
their own problems and/or solve problems proposed by others. The challenge started
September 15, 2008 and ended November 20, 2008, see http://www.causality.
inf.ethz.ch/pot-luck.php. One task proposed by a participant drew a lot of
attention: the cause-effect pair task. The problem was to try to determine in pairs of
variables (of known causal relationships), which one was the cause of the other. This
problem is hard for a lot of algorithms, which rely on the result of conditional indepen-
dence tests of three or more variables. Yet the winners of the challenge succeeded in
unraveling 8/8 correct causal directions (Zhang and Hyvärinen, 2009).
Our planned challenge ExpDeCo (Experimental Design in Causal Discovery) will
benchmark methods of experimental design in application to causal modeling. The goal
will be to identify effective methods to unravel causal models, requiring a minimum of
experimentation, using the Virtual Lab. A budget of virtual cash will be allocated to
participants to “buy” the right to observe or manipulate certain variables, manipula-
tions being more expensive that observations. The participants will have to spend their
budget optimally to make the best possible predictions on test data. This setup lends
itself to incorporating problems of relevance to development projects, in particular in
medicine and epidemiology where experimentation is difficult while developing new
methodology.
We are planning another challenge called CoMSICo for “Causal Models for System
Identification and Control”, which is more ambitious in nature because it will perform a
continuous evaluation of causal models rather than separating training and test phase. In
contrast with ExpDeCo in which the organizers will provide test data with prescribed
manipulations to test the ability of the participants to make predictions of the conse-
quences of actions, in CoMSICo, the participants will be in charge of making their
own plan of action (policy) to optimize an overall objective (e.g., improve the life ex-
pectancy of a population, improve the GNP, etc.) and they will be judged directly with
this objective, on an on-going basis, with no distinction between “training” and “test”
data. This challenge will also be via the Virtual Lab. The participants will be given
an initial amount of virtual cash, and, as previously, both actions and observations will
have a price. New in CoMSICo, virtual cash rewards will be given for achieving good
6
Causality Workbench
6. Conclusion
Our program of data exchange and benchmark proposes to challenge the research com-
munity with a wide variety of problems from many domains and focuses on realistic
settings. Causal discovery is a problem of fundamental and practical interest in many
areas of science and technology and there is a need for assisting policy making in all
these areas while reducing the costs of data collection and experimentation. Hence, the
identification of efficient techniques to solve causal problems will have a widespread
impact. By choosing applications from a variety of domains and making connections
between disciplines as varied as machine learning, causal discovery, experimental de-
sign, decision making, optimization, system identification, and control, we anticipate
that there will be a lot of cross-fertilization between different domains.
Acknowledgments
This project is an activity of the Causality Workbench supported by the Pascal network
of excellence funded by the European Commission and by the U.S. National Science
Foundation under Grant N0. ECCS-0725746. Any opinions, findings, and conclusions
or recommendations expressed in this material are those of the authors and do not nec-
essarily reflect the views of the National Science Foundation. We are very grateful to
all the members of the causality workbench team for their contribution and in particu-
lar to our co-founders Constantin Aliferis, Greg Cooper, André Elisseeff, Jean-Philippe
Pellet, Peter Spirtes, and Alexander Statnikov.
References
C. F. Aliferis, I. Tsamardinos, A. Statnikov, and L.E. Brown. Causal explorer: A prob-
abilistic network learning toolkit for biomedical discovery. In 2003 International
Conference on Mathematics and Engineering Techniques in Medicine and Biologi-
cal Sciences (METMBS), Las Vegas, Nevada, USA, June 23-26 2003. CSREA Press.
C. Glymour and G.F. Cooper, editors. Computation, Causation, and Discovery. AAAI
Press/The MIT Press, Menlo Park, California, Cambridge, Massachusetts, London,
England, 1999.
7
Guyon Statnikov Aliferis
volume 3, pages 1–33, WCCI2008 workshop on causality, Hong Kong, June 3-4
2008.
Jerry Jenkins. Signet: Boolean rile determination for abscisic acid signaling. In
Causality: objectives and assessment (NIPS 2008), volume 6, pages 215–224. JMLR
W&CP, 2010.
Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles and
Techniques. MIT Press, 2009.
J.-P. Pellet. Detecting simple causal effects in time series. In Causality: objectives and
assessment (NIPS 2008). JMLR W&CP volume 6, supplemental material, 2010.
P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. The MIT
Press, Cambridge, Massachusetts, London, England, 2000.
X. Sun, D. Janzing, and B. Schölkopf. Causal inference by choosing graphs with most
plausible Markov kernels. In Ninth International Symposium on Artificial Intelli-
gence and Mathematics, 2006.
E. Tuv. Pot-luck challenge: Tied. In Causality: objectives and assessment (NIPS 2008).
JMLR W&CP volume 6, supplemental material, 2010.
M. Voortman, D. Dash, and M. J. Druzdzel. Learning causal models that make correct
manipulation predictions. In Causality: objectives and assessment (NIPS 2008),
volume 6, pages 257–266. JMLR W&CP, 2010.
K. Zhang and A. Hyvärinen. Distinguishing causes from effects using nonlinear acyclic
causal models. In Causality: objectives and assessment (NIPS 2008), volume 6,
pages 157–164. JMLR W&CP, 2009.
8
JMLR: Workshop and Conference Proceedings 12:95–114, 2011 Causality in Time Series
Abstract
This paper reviews a class of methods to perform causal inference in the framework of
a structural vector autoregressive model. We consider three different settings. In the
first setting the underlying system is linear with normal disturbances and the structural
model is identified by exploiting the information incorporated in the partial correla-
tions of the estimated residuals. Zero partial correlations are used as input of a search
algorithm formalized via graphical causal models. In the second, semi-parametric,
setting the underlying system is linear with non-Gaussian disturbances. In this case
the structural vector autoregressive model is identified through a search procedure
based on independent component analysis. Finally, we explore the possibility of
causal search in a nonparametric setting by studying the performance of conditional
independence tests based on kernel density estimations.
Keywords: Causal inference, econometric time series, SVAR, graphical causal mod-
els, independent component analysis, conditional independence tests
1. Introduction
1.1. Causal inference in econometrics
Applied economic research is pervaded by questions about causes and effects. For
example, what is the effect of a monetary policy intervention? Is energy consumption
causing growth or the other way around? Or does causality run in both directions? Are
economic fluctuations mainly caused by monetary, productivity, or demand shocks?
Does foreign aid improve living standards in poor countries? Does firms’ expenditure
in R&D causally influence their profits? Are recent rises in oil prices in part caused by
speculation? These are seemingly heterogeneous questions, but they all require some
knowledge of the causal process by which variables came to take the values we observe.
As pointed out by Florens and Mouchart (1982), testing the hypothesis of Granger
noncausality corresponds to testing conditional independence. Given lags p, {Xt } does
not Granger cause {Yt }, if
10
Causal Search in SVAR
11
Moneta Chlass Entner Hoyer
to the ‘right’ rotation of the VAR model, that is the rotation compatible both with the
contemporaneous causal structure of the variable and the structure of the innovation
term. Let us consider a matrix B0 = I − Γ0 . If the system is normalized such that the
matrix Γ0 has all the elements of the principal diagonal equal to one (which can be done
straightforwardly), the diagonal elements of B0 will be equal to zero. We can write:
from which we see that B0 (and thus Γ0 ) determines in which form the values of a vari-
able Yi,t will be dependent on the contemporaneous value of another variable Y j,t . The
‘right’ rotation will also be the one which makes εt a vector of authentic innovation
terms, which are expected to be independent (not only over time, but also contempora-
neously) sources or shocks.
In the literature, different methods have been proposed to identify the SVAR model
(4) on the basis of the estimation of the VAR model (5). Notice that there are more
unobserved parameters in (4), whose number amounts to k2 (p + 1), than parameters
that can be estimated from (5), which are k2 p + k(k + 1)/2, so one has to impose at
least k(k − 1)/2 restrictions on the system. One solution to this problem is to get a
rotation of (5) such that the covariance matrix of the SVAR residuals Σε is diagonal,
using the Cholesky factorization of the estimated residuals Σu . That is, let P be the
lower-triangular Cholesky factorization of Σu (i.e. Σu = PP' ), let D be a k × k diagonal
matrix with the same diagonal as P, and let Γ0 = DP−1 . By pre-multiplying (5) by
Γ0 , it turns out that Σε = E[Γ0 ut u't Γ'0 ] = DD' , which is diagonal. A problem with
this method is that P changes if the ordering of the variables (Y1t , . . . , Ykt )' in Yt and,
consequently, the order of residuals in Σu , changes. Since researchers who estimate a
SVAR are often exclusively interested on tracking down the effect of a structural shock
εit on the variables Y1,t , . . . , Yk,t over time (impulse response functions), Sims (1981)
suggested investigating to what extent the impulse response functions remain robust
under changes of the order of variables.
Popular alternatives to the Cholesky identification scheme are based either on the
use of a priori, theory-based, restrictions or on the use of long-run restrictions. The
former solution consists in imposing economically plausible constraints on the con-
temporaneous interactions among variables (Blanchard and Watson, 1986; Bernanke,
1986) and has the drawback of ultimately depending on the a priori reliability of eco-
nomic theory, similarly to the Cowles Commission approach. The second solution is
based on the assumptions that certain economic shocks have long-run effect to other
variables, but do not influence in the long-run the level of other variables (see Shapiro
and Watson, 1988; Blanchard and Quah, 1989; King et al., 1991). This approach has
been criticized as not being very reliable unless strong a priori restrictions are imposed
(see Faust and Leeper, 1997).
In the rest of the paper, we first present a method, based on the graphical causal
model framework, to identify the SVAR (section 2). This method is based on condi-
tional independence tests among the estimated residuals of the VAR estimated model.
Such tests rely on the assumption that the shocks affecting the model are Gaussian.
12
Causal Search in SVAR
We then relax the Gaussianity assumption and present a method to identify the SVAR
model based on independent component analysis (section 3). Here the main assump-
tion is that shocks are non-Gaussian and independent. Finally (section 4), we explore
the possibility of extending the framework for causal inference to a nonparametric set-
ting. In section 5 we wrap up the discussion and conclude by formulating some open
questions.
13
Moneta Chlass Entner Hoyer
second step, conditional independence relations (or d-separations, which are the graph-
ical characterization of conditional independence) are merely used to erase edges and,
in further steps, to direct edges. The output of such algorithms are not necessarily one
single graph, but a class of Markov equivalent graphs.
There is nothing neither in the Markov or faithfulness condition, nor in the
constraint-based algorithms that limits them to linear and Gaussian settings. Graph-
ical causal models do not require per se any a priori specification of the functional
dependence between variables. However, in applications of graphical models to SVAR,
conditional independence is ascertained by testing vanishing partial correlations (Swan-
son and Granger, 1997; Bessler and Lee, 2002; Demiralp and Hoover, 2003; Moneta,
2008). Since normal distribution guarantees the equivalence between zero partial cor-
relation and conditional independence, these applications deal de facto with linear and
Gaussian processes.
on the basis that αk = 0 ⇔ ρ(uit , ukt |u jt ) = 0. Since Swanson and Granger (1997) impose
the partial correlation constraints looking only at the set of partial correlations of order
one (that is conditioned on only one variable), in order to run their tests they consider
regression equations with only two regressors, as in equation (7).
Bessler and Lee (2002) and Demiralp and Hoover (2003) use Fisher’s z that is
incorporated in the software TETRAD (Scheines et al., 1998):
" #
1! |1 + ρXY.K |
z(ρXY.K , T ) = T − |K| − 3 log , (8)
2 |1 − ρXY.K |
where |K| equals the number of variables in K and T the sample size. If the variables
(for instance X = uit , Y = ukt , K = (u jt , uht )) are normally distributed, we have that
Yt = Π' Xt + ut , (10)
14
Causal Search in SVAR
where X't = [Y't−1 , ...,Y't−p ], which has dimension (1 × kp) and Π' = [A1 , . . . , A p ], which
has dimension (k × kp). In case of stable VAR process (see next subsection), the condi-
tional maximum likelihood estimate of Π for a sample of size T is given by
T T −1
' '
Π̂ = Yt Xt Xt Xt .
' ' '
t=1 t=1
t=1 t=1
which coincides with the estimated coefficient vector from an OLS regression of Yit on
Xt (Hamilton 1994: 293). The maximum likelihood estimate of the matrix of variance
+
and covariance among the error terms Σu turns out to be Σ̂u = (1/T ) Tt=1 ût û't , where
ût = Yt − Π̂' Xt . Therefore, the maximum likelihood estimate of the covariance between
+
uit and u jt is given by the (i, j) element of Σ̂u : σ̂i j = (1/T ) Tt=1 ûit û jt . Denoting by σi j
the (i, j) element of Σu , let us first define the following matrix transform operators: vec,
which stacks the columns of a k × k matrix into a vector of length k2 and vech, which
vertically stacks the elements of a k × k matrix on or below the principal diagonal into
a vector of length k(k + 1)/2. For example:
σ11
, - σ , - σ11
vec
σ11 σ12
= 21 , vech σ11 σ12 = σ
σ21 σ22 21 .
σ12 σ21 σ22
σ22
σ22
The process being stationary and the error terms Gaussian, it turns out that:
√ d
T [vech(Σ̂u ) − vech(Σu )] −→ N(0, Ω), (11)
where Ω = 2D+k (Σu ⊗ Σu )(D+k )' , D+k ≡ (D'k Dk )−1 D'k , Dk is the unique (k2 × k(k +
1)/2) matrix satisfying Dk vech(Ω) = vec(Ω), and ⊗ denotes the Kronecker product (see
Hamilton 1994: 301). For example, for k = 2, we have,
√ σ̂11 − σ11 0 2σ211 2σ11 σ12 2σ212
d
T σ̂12 − σ12 2
−→ N 0 , 2σ11 σ12 σ11 σ22 + σ12 2σ12 σ22
2
σ̂22 − σ22 0 2σ12 2σ12 σ22 2σ222
Therefore, to test the null hypothesis that ρ(uit , u jt ) = 0 from the VAR estimated resid-
uals, it is possible to use the Wald statistic:
T (σ̂i j )2
≈ χ2 (1).
σ̂ii σ̂ j j + σ̂2i j
15
Moneta Chlass Entner Hoyer
The Wald statistic for testing vanishing partial correlations of any order is ob-
tained by applying the delta method, which suggests that if XT is a (r × 1) sequence
√ √ d
of vector-valued random-variables and if [ T (X1T − θ1 ), . . . , T (XrT − θr )] −→ N(0, Σ)
and h1 , . . . , hr are r real-valued functions of θ = (θ1 , . . . , θr ), hi : Rr → R, defined and
continuously differentiable in a neighborhood ω of the parameter point θ and such that
the matrix B = ||∂hi /∂θ j || of partial derivatives is nonsingular in ω, then:
√ √ d
[ T [h1 (XT ) − h1 (θ)], . . . , T [hr (XT ) − hr (θ)]] −→ N(0, BΣB' )
The Wald test of the null hypothesis corr(u1t , u3t |u2t ) = 0 is given by:
Yt = A1 Yt−1 + . . . + A p Yt−p + ut ,
is nonstationary and integrated of order one (∼ I(1)). This means that the VAR process
Yt is not stable, i.e. det(Ik − A1 z − A p z p ) is equal to zero for some |z| ≤ 1 (Lütkepohl,
2006), and that each component ∆Yit of ∆Yt = (Yt − Yt−1 ) is stationary (I(0)), that is
it has time-invariant means, variances and covariance structure. A linear combination
between between the elements of Yt is called a cointegrating relationship if there is a
linear combination c1 Y1t + . . . + ck Ykt which is stationary (I(0)).
16
Causal Search in SVAR
If it is the case that the VAR process is unstable with the presence of cointegrating
relationships, it is more appropriate (Lütkepohl, 2006; Johansen, 2006) to estimate the
following re-parametrization of the VAR model, called Vector Error Correction Model
(VECM):
which is equivalent to equation (11) (see it again for the definition of the various op-
erators). Thus, it turns out that the asymptotic distribution of the maximum likelihood
estimator Σ̃u is the same as the OLS estimation Σ̂u for the case of stable VAR.
Thus, the application of the method described for testing residuals zero partial cor-
relations can be applied straightforwardly to cointegrated data. The model is estimated
as a VECM error correction model using Johansen’s (1988, 1991) approach, correla-
tions are tested exploiting the asymptotic distribution of Σ̃u and finally can be parame-
terized back in its VAR form of equation (3).
17
Moneta Chlass Entner Hoyer
Step 3 Apply a causal search algorithm to recover the causal structure among u1t , . . . ,
ukt , which is equivalent to the causal structure among Y1t , . . . , Ykt (cfr. section 1.2 and
see Moneta 2003). In case of acyclic (no feedback loops) and causally sufficient (no
latent variables) structure, the suggested algorithm is the PC algorithm of Spirtes et al.
(2000, pp. 84-85). Moneta (2008) suggested few modifications to the PC algorithm in
order to make the orientation of edges compatible with as many conditional indepen-
dence tests as possible. This increases the computational time of the search algorithm,
but considering the fact that VAR models deal with a few number of time series vari-
ables (rarely more than six to eight; see Bernanke et al. 2005), this slowing down does
not create a serious concern in this context. Table 1 reports the modified PC algorithm.
In case of acyclic structure without causal sufficiency (i.e. possibly including latent vari-
ables), the suggested algorithm is FCI (Spirtes et al. 2000, pp. 144-145). In the case
of no latent variables and in the presence of feedback loops, the suggested algorithm
is CCD (Richardson and Spirtes, 1999). There is no algorithm in the literature which
is consistent for search when both latent variables and feedback loops may be present.
If the goal of the study is only impulse response analysis (i.e. tracing out the effects
of structural shocks ε1t , . . . , εkt on Yt , Yt−1 , . . .) and neither contemporaneous feedbacks
nor latent variables can be excluded a priori, a possible solution is to apply only steps
(A) and (B) of the PC algorithm. If the resulting set of possible causal structures (rep-
resented by an undirected graph) contains a manageable number of elements, one can
study the characteristics of the impulse response functions which are robust across all
the possible causal structures, where the presence of both feedbacks and latent variables
is allowed (Moneta, 2004).
Step 4 Calculate structural coefficients and impulse response functions. If the out-
put of Step 3 is a set of causal structures, run sensitivity analysis to investigate the
robustness of the conclusions under the different possible causal structures. Bootstrap
procedures may also be applied to determine which is the most reliable causal order
(see simulations and applications in Demiralp et al., 2008).
18
Causal Search in SVAR
Table 1: Search algorithm (adapted from the PC Algorithm of Spirtes et al. (2000:
84-85); in bold character the modifications). Under the assumption of Gaus-
sianity conditional independence is tested by zero partial correlation tests.
19
Moneta Chlass Entner Hoyer
20
Causal Search in SVAR
identify the matrix Γ0 , we also obtain the matrix B0 for the contemporaneous effects.
As pointed out above, the matrix Γ−1 0 (and hence Γ0 ) can be estimated using ICA up to
ordering, scaling, and sign. With the restriction of B0 representing an acyclic system,
we can resolve these ambiguities and are able to fully identify the model. For sim-
plicity, let us assume that the variables are arranged according to a causal ordering, so
that the matrix B0 is strictly lower triangular. From equation (16) then follows that the
matrix Γ0 is lower triangular with all ones on the diagonal. Using this information, the
ambiguities of ICA can be resolved in the following way.
The lower triangularity of B0 allows us to find the unique permutation of the rows of
Γ0 , which yields all non-zero elements on the diagonal of Γ0 , meaning that we replace
the matrix Γ0 with Q1 Γ0 where Q1 is the uniquely determined permutation matrix.
Finding this permutation resolves the ordering-ambiguity of ICA and links the shocks
εt to the components of the residuals ut in a one-to-one manner. The sign- and scaling-
ambiguity is now easy to fix by simply dividing each row of Γ0 (the row-permuted
version from above) by the corresponding diagonal element yielding all ones on the
diagonal, as implied by Equation (16). This ensures that the connection strength of the
shock εt on the residual ut is fixed to one in our model (Equation (15)).
For the general case where B0 is not arranged in the causal order, the above ar-
guments for solving the ambiguities still apply. Furthermore, we can find the causal
order of the contemporaneous variables by performing simultaneous row- and column-
permutations on Γ0 yielding the matrix closest to lower triangular, in particular Γ̃0 =
Q2 Γ0 Q'2 with an appropriate permutation matrix Q2 . In case non of these permutations
leads to a close to lower triangular matrix a warning is issued.
Essentially, the assumption of acyclicity allows us to uniquely connect the struc-
tural shocks εt to the components of ut and fully identify the contemporaneous struc-
ture. Details of the procedure can be found in Shimizu et al. (2006); Hyvärinen et al.
(2010). In the sense of the Cholesky factorization of the covariance matrix explained
in Section 1 (with PD−1 = Γ−1 0 ), full identifiability means that a causal order among the
contemporaneous variables can be determined.
In addition to yielding full identification, an additional benefit of using the ICA-
based procedure when shocks are non-Gaussian is that it does not rely on the faithful-
ness assumption, which was necessary in the Gaussian case.
We note that there are many ways of exploiting non-Gaussian shocks for model
identification as alternatives to directly using ICA. One such approach was introduced
by Shimizu et al. (2009). Their method relies on iteratively finding an exogenous vari-
able and regressing out their influence on the remaining variables. An exogenous vari-
able is characterized by being independent of the residuals when regressing any other
variable in the model on it. Starting from the model in equation (15), this procedure
returns a causal ordering of the variables ut and then the matrix B0 can be estimated
using the Cholesky approach.
One relatively strong assumption of the above methods is the acyclicity of the con-
temporaneous structure. In Lacerda et al. (2008) an extension was proposed where
feedback loops were allowed. In terms of the matrix B0 this means that it is not re-
21
Moneta Chlass Entner Hoyer
stricted to being lower triangular (in an appropriate ordering of the variables). While in
general this model is not identifiable because we cannot uniquely match the shocks to
the residuals, Lacerda et al. (2008) showed that the model is identifiable when assuming
stability of the generating model in (15) (the absolute value of the biggest eigenvalue in
B0 is smaller than one) and disjoint cycles.
Another restriction of the above model is that all relevant variables must be in-
cluded in the model (causal sufficiency). Hoyer et al. (2008b) extended the above
model by allowing for hidden variables. This leads to an overcomplete basis ICA
model, meaning that there are more independent non-Gaussian sources than observed
mixtures. While there exist methods for estimating overcomplete basis ICA models,
those methods which achieve the required accuracy do not scale well. Additionally,
the solution is again only unique up to ordering, scaling, and sign, and when including
hidden variables the ordering-ambiguity cannot be resolved and in some cases leads to
several observationally equivalent models, just as in the cyclic case above.
We note that it is also possible to combine the approach of section 2 with that de-
scribed here. That is, if some of the shocks are Gaussian or close to Gaussian, it may
be advantageous to use a combination of constraint-based search and non-Gaussianity-
based search. Such an approach was proposed in Hoyer et al. (2008a). In particular,
the proposed method does not make any assumptions on the distributions of the VAR-
residuals ut . Basically, the PC algorithm (see Section 2) is run first, followed by uti-
lization of whatever non-Gaussianity there is to further direct edges. Note that there is
no need to know in advance which shocks are non-Gaussian since finding such shocks
is part of the algorithm.
Finally, we need to point out that while the basic ICA-based approach does not
require the faithfulness assumption, the extensions discussed at the end of this section
do.
4. Nonparametric setting
4.1. Theory
Linear systems dominate VAR, SVAR, and more generally, multivariate time series
models in econometrics. However, it is not always the case that we know how a variable
X may cause another variable Y. It may be the case that we have little or no a priori
knowledge about the way how Y depends on X. In its most general form we want
to know whether X is independent of Y conditional on the set of potential graphical
parents Z, i.e.
H0 : Y ⊥ ⊥ X | Z, (17)
where Y, X, Z is a set of time series variables. Thereby, we do not per se require an
a priori specification of how Y possibly depends on X. However, constraint based
algorithms typically specify conditional independence in a very restrictive way. In con-
tinuous settings, they simply test for nonzero partial correlations, or in other words, for
linear (in)dependencies. Hence, these algorithms will fail whenever the data generation
process (DGP) includes nonlinear causal relations.
22
Causal Search in SVAR
f (Y, X, Z) f (YZ)
H0 : = . (18)
f (XZ) f (Z)
If we define h1 (·) := f (Y, X, Z) f (Z), and h2 (·) := f (YZ) f (XZ), we have:
where Xi , Yi , and Zi are the ith realization of the respective time series, K denotes the
kernel function, b indicates a scalar bandwidth parameter, and K p represents a product
kernel2 .
So far, we have shown how we can estimate h1 and h2 . To see whether these are
different, we require some similarity measure between both conditional densities. There
are different ways to measure the distance between a product of densities:
where a(·) is a nonnegative weighting function. Both the weighting function a(·),
and the resulting test statistic are specified in Su and White (2008).
(ii) The Euclidean distance proposed by Szekely and Rizzo (2004) in their ‘energy
test’:
1 '' 1 '' 1 ''
n n n n n n
dE = ||h1i − h2 j || − ||h1i − h1 j || − ||h2 − h2 j ||, (22)
n i=1 j=1 2n i=1 j=1 2n i=1 j=1 i
@d
2. I.e. K p ((Zi − z)/b) = j=1 K((Z ji − z j )/b). For our simulations (see next section) we choose the
kernel: K(u) = 2
(3 − u )φ(u)/2, with φ(u) the standard normal probability density function. We use a
“rule-of-thumb” bandwidth: b = n−1/8.5 .
23
Moneta Chlass Entner Hoyer
Given these test statistics and their distributions, we compute the type-I error, or p-value
of our test problem (19). If Z = ∅, the tests are available in R-packages energy and
cramer. The Hellinger distance is not suitable here, since one can only test for Z ! ∅.
For Z ! ∅, our test problem (19) requires higher dimensional kernel density esti-
mation. The more dimensions, i.e. the more elements in Z, the scarcer the data, and
the greater the distance between two subsequent data points. This so-called Curse of
dimensionality strongly reduces the accuracy of a nonparametric estimation (Yatchew,
1998). To circumvent this problem, we calculate the type-I errors for Z ! ∅ by a local
bootstrap procedure, as described in Su and White (2008, pp. 840-841) and Paparoditis
and Politis (2000, pp. 144-145). Local bootstrap draws repeatedly with replacement
from the sample and counts how many times the bootstrap statistic is larger than the
test statistic of the entire sample. Details on the local bootstrap procedure ca be found
in appendix A.
Now, let us see how this procedure fares in those time series settings, where other
testing procedures failed - the case of nonlinear time series.
Therein, V1,t ⊥
⊥ V2,t |V1,t−1 , since V1,t−1 d-separates V1,t from V2,t , while V2,t ⊥
⊥ V3,s ,
for any t and s. Hence, the set of variables Z, conditional on which two sets of vari-
ables X and Y are independent of each other, contains zero elements, i.e. V2,t ⊥ ⊥ V3,t−1 ,
contains one element, i.e. V1,t ⊥ ⊥ V2,t |V1,t−1 , or contains two elements, i.e. V1,t ⊥⊥
3. An alternative Euclidean distance is proposed by Baringhaus and Franz (2004) in their Cramer test.
This distance turns out to be dE /2. The only substantial difference from the distance proposed in (ii)
lies in the method to obtain the critical values (see Baringhaus and Franz 2004).
24
Causal Search in SVAR
Take the first line of Table 2. For size DGPs, H0 holds everywhere. A test performs
accurately if it rejects H0 in accordance with the respective theoretical significance
level. We see that the energy test rejects H0 slightly more often than it should (0.065 >
0.05; 0.122 > 0.1), whereas the Cramer test does not reject H0 often enough (0.000 <
0.05, 0.000 < 0.1). In comparison to the standard parametric Fisher’s z, we see that the
4. An example of collider is displayed in Figure 1: V2,t forms a collider between V1,t−1 and V2,t−1 .
25
Moneta Chlass Entner Hoyer
latter rejects H0 much more often than it should. The energy test keeps the type-I error
most accurately. Contrary to both nonparametric tests, the parametric procedure leads
us to suspect a lot more causal relationships than there actually are, if #Z = 0.
How well do these tests perform if H0 does not hold anywhere? That is, how
accurately do they reject H0 if it is false (power-DGPs)? For linear time series, we see
that the nonparametric energy test has nearly as much power as Fisher’s z. For nonlinear
time series, the energy test clearly outperforms Fisher’s z5 . As it did for size, Cramer’s
test generally underperforms in terms of power. Interestingly, its power appears to be
higher for higher degrees of nonlinearity. In summary, if one wishes to test for marginal
independence without any information on the type of a potential dependence, one would
opt for the energy test. It has a size close to the theoretical significance level, and has
power similar to a parametric specification.
Let us turn to #Z = 1, where H0 := X ⊥ ⊥ Y|Z, for which the results are shown in
Table 3. Starting with size DGPs, tests based on Hellinger and Euclidian distance
slightly underreject H0 whereas for the highest polynomial degree, the Hellinger test
strongly overrejects H0 . The parametric Fisher’s z slightly overrejects H0 in case of
linearity, and for higher degrees, starts to underreject H0 .
Turning to power DGPs, Fisher’s z suffers a dramatic loss in power for those poly-
nomial degrees which depart most from linearity, i.e. quadratic, and quartic relations.
Nonparametric tests which do not require linearity have high power in absolute terms,
and nearly twice as much as compared to Fisher’s z. The power properties of the non-
parametric procedures indicate that our local bootstrap succeeds in mitigating the Curse
of dimensionality. In sum, nonparametric tests exhibit good power properties for #Z = 1
whereas Fisher’s z would fail to discover underlying quadratic or quartic relationships
in some 60%, and 40% of the cases, respectively.
5. For cubic time series, Fisher’s z performs as well as the energy test does. This may be due to the fact
that a cubic relation resembles more to a line than other polynomial specifications do.
26
Causal Search in SVAR
The results for #Z = 2 are presented in Table 4. We find that both nonparametric
tests have a size which is notably smaller than the theoretical significance level we in-
duce. Hence, both have a strong tendency to underreject H0 . Turning to power DGPs,
we find that the Euclidean test still has over 90% power to correctly reject H0 . For
those polynomial degrees which depart most from linearity, i.e. quadratic and quartic,
the Euclidean test has three times as much power as Fisher’s z. However, the Hellinger
test performs even worse than Fisher’s z. Here, it may be the Curse of dimensionality
which starts to show an impact.
To sum up, we can say that both marginal independencies, and higher dimensional
conditional independencies, i.e. (#Z = 1, 2) are best tested for using Euclidean tests. The
Hellinger test seems to be more affected by the Curse of dimensionality. We see that our
local bootstrap procedure mitigates the latter, but we admit that the number of variables
our nonparametric procedure can deal with is very small. Here, it might be promising to
opt for semiparametric (Chu and Glymour, 2008), rather than nonparametric procedures
which combine parametric and nonparametric approaches.
5. Conclusions
The difficulty of learning causal relations from passive, that is non-experimental, obser-
vations is one of the central challenges of econometrics. Traditional solutions involve
the distinction between structural and reduced form model. The former is meant to
formalize the unobserved data generating process, whereas the latter aims to describe a
simpler transformation of that process. The structural model is articulated hinging on
a priori economic theory. The reduced form model is formalized in such a way that it
can be estimated directly from the data. In this paper, we have presented an approach to
identify the structural model which minimizes the role of a priori economic theory and
emphasizes the need of an appropriate and rich statistical model of the data. Graphical
27
Moneta Chlass Entner Hoyer
28
Causal Search in SVAR
6. Appendix
6.1. Appendix 1 - Details of the bootstrap procedure from 4.1.
(1) Draw a bootstrap sampling Zt∗ (for t = 1, . . . , n) from the estimated kernel density
+
fˆ(z) = n−1 b−d nt=1 K p ((Zt − z)/b).
(2) For t = 1, . . . , n, given Zt∗ , draw Xt∗ and Yt∗ independently from the estimated kernel
density fˆ(x|Zt∗ ) and fˆ(y|Zt∗ ) respectively.
(3) Using Xt∗ , Yt∗ , and Zt∗ , compute the bootstrap statistic S n∗ using one of the dis-
tances defined above.
∗ }I .
(4) Repeat steps (1) and (2) I times to obtain I statistics {S ni i=1
References
E. Baek and W. Brock. A general test for nonlinear Granger causality: Bivariate model.
Discussin paper, Iowa State University and University of Wisconsin, Madison, 1992.
29
Moneta Chlass Entner Hoyer
B.S. Bernanke, J. Boivin, and P. Eliasz. Measuring the Effects of Monetary Policy: A
Factor-Augmented Vector Autoregressive (FAVAR) Approach. Quarterly Journal of
Economics, 120(1):387–422, 2005.
D. A. Bessler and S. Lee. Money and prices: US data 1869-1914 (a study with directed
graphs). Empirical Economics, 27:427–446, 2002.
O. J. Blanchard and D. Quah. The dynamic effects of aggregate demand and supply
disturbances. The American Economic Review, 79(4):655–673, 1989.
O. J. Blanchard and M. W. Watson. Are business cycles all alike? The American
business cycle: Continuity and change, 25:123–182, 1986.
T. Chu and C. Glymour. Search for additive nonlinear time series causal models. The
Journal of Machine Learning Research, 9:967–991, 2008.
S. Demiralp and K. D. Hoover. Searching for the causal structure of a vector autore-
gression. Oxford Bulletin of Economics and Statistics, 65:745–767, 2003.
M. Eichler. Granger causality and path diagrams for multivariate time series. Journal
of Econometrics, 137(2):334–353, 2007.
30
Causal Search in SVAR
C. Hiemstra and J. D. Jones. Testing for linear and nonlinear Granger causality in the
stock price-volume relation. Journal of Finance, 49(5):1639–1664, 1994.
K.D. Hoover, S. Demiralp, and S.J. Perez. Empirical Identification of the Vector Au-
toregression: The Causes and Effects of US M2. In The Methodology and Practice
of Econometrics. A Festschrift in Honour of David F. Hendry, pages 37–58. Oxford
University Press, 2009.
31
Moneta Chlass Entner Hoyer
A. Moneta. Graphical causal models and VARs: an empirical assessment of the real
business cycles hypothesis. Empirical Economics, 35(2):275–300, 2008.
A. Moneta, D. Entner, P.O. Hoyer, and A. Coad. Causal inference by independent com-
ponent analysis with applications to micro-and macroeconomic data. Jena Economic
Research Papers, 2010:031, 2010.
E. Paparoditis and D. N. Politis. The local bootstrap for kernel estimators under general
dependence conditions. Annals of the Institute of Statistical Mathematics, 52(1):139–
159, 2000.
32
Causal Search in SVAR
C. A. Sims. An autoregressive index model for the u.s. 1948-1975. In J. Kmenta and
J.B. Ramsey, editors, Large-scale macro-econometric models: theory and practice,
pages 283–327. North-Holland, 1981.
P. Spirtes, C. Glymour, and R. Scheines. Causation, prediction, and search. MIT Press,
Cambridge MA, 2nd edition, 2000.
H. White and X. Lu. Granger Causality and Dynamic Structural Systems. Journal of
Financial Econometrics, 8(2):193, 2010.
33
Moneta Chlass Entner Hoyer
34
JMLR: Workshop and Conference Proceedings 12:65–94, 2011 Causality in Time Series
Abstract
This review focuses on dynamic causal analysis of functional magnetic resonance
(fMRI) data to infer brain connectivity from a time series analysis and dynamical
systems perspective. Causal influence is expressed in the Wiener-Akaike-Granger-
Schweder (WAGS) tradition and dynamical systems are treated in a state space mod-
eling framework. The nature of the fMRI signal is reviewed with emphasis on the
involved neuronal, physiological and physical processes and their modeling as dy-
namical systems. In this context, two streams of development in modeling causal
brain connectivity using fMRI are discussed: time series approaches to causality in a
discrete time tradition and dynamic systems and control theory approaches in a con-
tinuous time tradition. This review closes with discussion of ongoing work and future
perspectives on the integration of the two approaches.
Keywords: fMRI, hemodynamics, state space model, Granger causality, WAGS in-
fluence
1. Introduction
Understanding how interactions between brain structures support the performance of
specific cognitive tasks or perceptual and motor processes is a prominent goal in cog-
nitive neuroscience. Neuroimaging methods, such as Electroencephalography (EEG),
Magnetoencephalography (MEG) and functional Magnetic Resonance Imaging (fMRI)
are employed more and more to address questions of functional connectivity, inter-
region coupling and networked computation that go beyond the ‘where’ and ‘when’ of
task-related activity (Friston, 2002; Horwitz et al., 2000; McIntosh, 2004; Salmelin and
Kujala, 2006; Valdes-Sosa et al., 2005a). A network perspective onto the parallel and
distributed processing in the brain - even on the large scale accessible by neuroimaging
methods - is a promising approach to enlarge our understanding of perceptual, cognitive
and motor functions. Functional Magnetic Resonance Imaging (fMRI) in particular is
increasingly used not only to localize structures involved in cognitive and perceptual
processes but also to study the connectivity in large-scale brain networks that support
these functions.
Generally a distinction is made between three types of brain connectivity. Anatom-
ical connectivity refers to the physical presence of an axonal projection from one brain
area to another. Identification of large axon bundles connecting remote regions in the
brain has recently become possible non-invasively in vivo by diffusion weighted Mag-
netic resonance imaging (DWMRI) and fiber tractography analysis (Johansen-Berg and
Behrens, 2009; Jones, 2010). Functional connectivity refers to the correlation structure
(or more generally: any order of statistical dependency) in the data such that brain ar-
eas can be grouped into interacting networks. Finally, effective connectivity modeling
moves beyond statistical dependency to measures of directed influence and causality
within the networks constrained by further assumptions (Friston, 1994).
Recently, effective connectivity techniques that make use of the temporal dynamics
in the fMRI signal and employ time series analysis and systems identification theory
have become popular. Within this class of techniques two separate developments have
been most used: Granger causality analysis (GCA; Goebel et al., 2003; Roebroeck
et al., 2005; Valdes-Sosa, 2004) and Dynamic Causal Modeling (DCM; Friston et al.,
2003). Despite the common goal, there seem to be differences between the two meth-
ods. Whereas GCA explicitly models temporal precedence and uses the concept of
Granger causality (or G-causality) mostly formulated in a discrete time-series analy-
sis framework, DCM employs a biophysically motivated generative model formulated
in a continuous time dynamic system framework. In this chapter we will give a gen-
eral causal time-series analysis perspective onto both developments from what we have
called the Wiener-Akaike-Granger-Schweder (WAGS) influence formalism (Valdes-
Sosa et al., in press).
Effective connectivity modeling of neuroimaging data entails the estimation of mul-
tivariate mathematical models that benefits from a state space formulation, as we will
discuss below. Statistical inference on estimated parameters that quantify the directed
influence between brain structures, either individually or in groups (model comparison)
then provides information on directed connectivity. In such models, brain structures are
defined from at least two viewpoints. From a structural viewpoint they correspond to a
set of “nodes" that comprise a graph, the purpose of causal discovery being the identi-
fication of active links in the graph. The structural model contains i) a selection of the
structures in the brain that are assumed to be of importance in the cognitive process or
task under investigation, ii) the possible interactions between those structures and iii)
the possible effects of exogenous inputs onto the network. The exogenous inputs may
be under control of the experimenter and often have the form of a simple indicator func-
tion that can represent, for instance, the presence or absence of a visual stimulus in the
subject’s view. From a dynamical viewpoint brain structures are represented by states
or variables that describe time varying neural activity within a time-series model of the
measured fMRI time-series data. The functional form of the model equations can em-
36
Causal analysis of fMRI
37
Roebroeck Seth Valdes-Sosa
Figure 1: The neuronal, physiological and physical processes (top row) and variables
and parameters involved (middle row) in the complex causal chain of events
that leads to the formation of the fMRI signal. The bottom row lists some
mathematical models of the sub-processes that play a role in the analysis and
modeling of fMRI signals. See main text for further explanation.
38
Causal analysis of fMRI
the action potential along the axon, and release of neurotransmitter substances into the
synaptic cleft at arrival of an action potential at the synaptic terminal. There are many
different types of neurons in the mammalian brain that express these processes in differ-
ent degrees and ways. In addition, there are other cells, such as glia cells, that perform
important processes, some of them possibly directly relevant to computation or signal-
ing. As explained below, the fMRI signal is sensitive to the local oxidative metabolism
in the brain. This means that, indirectly, it mainly reflects the most energy consuming
of the neuronal processes. In primates, post-synaptic processes account for the great
majority (about 75%) of the metabolic costs of neuronal signaling events (Attwell and
Iadecola, 2002). Indeed, the greater sensitivity of fMRI to post-synaptic activity, rather
than axon generation and propagation (‘spiking’), has been experimentally verified. For
instance, in a simultaneous invasive electrophysiology and fMRI measurement in the
primate, Logothetis and colleagues (Logothetis et al., 2001) found the fMRI signal to
be more correlated to the mean Local Field Potential (LFP) of the electrophysiologi-
cal signal, known to reflect post-synaptic graded potentials, than to high-frequency and
multi-unit activity, known to reflect spiking. In another study it was shown that, by
suppressing action potentials while keeping LFP responses intact by injecting a sero-
tonin agonist, the fMRI response remained intact, again suggesting that LFP is a better
predictor for fMRI activity (Rauch et al., 2008). These results confirmed earlier results
obtained on the cerebellum of rats (Thomsen et al., 2004).
Neuronal activity, dynamics and computation can be modeled at a different levels
of abstraction, including the macroscopic (whole brain areas), mesoscopic (sub-areas
to cortical columns) and microscopic level (individual neurons or groups of these).
The levels most relevant to modeling fMRI signals are at the macro- and mesoscopic
levels. Macroscopic models used to represent considerable expanses of gray matter tis-
sue or sub-cortical structures as Regions Of Interest (ROIs) prominently include single
variable deterministic (Friston et al., 2003) or stochastic (autoregressive; Penny et al.,
2005; Roebroeck et al., 2005; Valdes-Sosa et al., 2005b) exponential activity decay
models. Although the simplicity of such models entail a large degree of abstraction
in representing neuronal activity dynamics, their modest complexity is generally well
matched to the limited temporal resolution available in fMRI. Nonetheless, more com-
plex multi-state neuronal dynamics models have been investigated in the context of
fMRI signal generation. These include the 2 state variable Wilson-Cowan model (Mar-
reiros et al., 2008), with one excitatory and one inhibitory sub-population per ROI and
the 3 state variable Jansen-Rit model with a pyramidal excitatory output population
and an inhibitory and excitatory interneuron population, particularly in the modeling of
simultaneously acquired fMRI and EEG (Valdes-Sosa et al., 2009).
The physiology and physics of the fMRI signal is most easily explained by start-
ing with the physics. We will give a brief overview here and refer to more dedicated
overviews (Haacke et al., 1999; Uludag et al., 2005) for extended treatment. The hall-
mark of Magnetic Resonance (MR) spectroscopy and imaging is the use of the reso-
nance frequency of magnetized nuclei possessing a magnetic moment, mostly protons
(hydrogen nuclei, 1H), called ‘spins’. Radiofrequency antennas (RF coils) can measure
39
Roebroeck Seth Valdes-Sosa
signal from ensembles of spins that resonate in phase at the moment of measurement.
The first important physical factor in MR is the main magnetic field strength (B0 ),
which determines both the resonance frequency (directly proportional to field-strength)
and the baseline signal-to-noise ratio of the signal, since higher fields make a larger pro-
portion of spins in the tissue available for measurement. The most used field strengths
for fMRI research in humans range from 1,5T (Tesla) to 7T. The second important
physical factor – containing several crucial parameters – is the MR pulse-sequence that
determines the magnetization preparation of the sample and the way the signal is sub-
sequently acquired. The pulse sequence is essentially a series of radiofrequency pulses,
linear magnetic gradient pulses and signal acquisition (readout) events (Bernstein et al.,
2004; Haacke et al., 1999). An important variable in a BOLD fMRI pulse sequence is
whether it is a gradient-echo (GRE) sequence or a spin-echo (SE) sequence, which
determines the granularity of the vascular processes that are reflected in the signal, as
explained later this section. These effects are further modulated by the echo-time (time
to echo; TE) and repetition time (time to repeat; TR) that are usually set by the end-user
of the pulse sequence. Finally, an important variable within the pulse sequence is the
type of spatial encoding that is employed. Spatial encoding can primarily be achieved
with gradient pulses and it embodies the essence of ‘Imaging’ in MRI. It is only with
spatial encoding that signal can be localized to certain ‘voxels’ (volume elements) in the
tissue. A strength of fMRI as a neuroimaging technique is that an adjustable trade-off
is available to the user between spatial resolution, spatial coverage, temporal resolution
and signal-to-noise ratio (SNR) of the acquired data. For instance, although fMRI can
achieve excellent spatial resolution at good SNR and reasonable temporal resolution,
one can choose to sacrifice some spatial resolution to gain a better temporal resolution
for any given study. Note, however, that this concerns the resolution and SNR of the
data acquisition. As explained below, the physiology of fMRI can put fundamental lim-
itations on the nominal resolution and SNR that is achieved in relation to the neuronal
processes of interest.
On the physiological level, the main variables that mediate the BOLD contrast in
fMRI are cerebral blood flow (CBF), cerebral blood volume (CBV) and the cerebral
metabolic rate of oxygen (CMRO2) which all change the oxygen saturation of the blood
(as usefully quantified by the concentration of deoxygenated hemoglobin). The BOLD
contrast is made possible by the fact that oxygenation of the blood changes its mag-
netic susceptibility, which has an effect on the MR signal as measured in GRE and SE
sequences. More precisely, oxygenated and deoxygenated hemoglobin (oxy-Hb and
deoxy-Hb) have different magnetic properties, the former being diamagnetic and the
latter paramagnetic. As a consequence, deoxygenated blood creates local microscopic
magnetic field gradients, such that local spin ensembles dephase, which is reflected in a
lower MR signal. Conversely oxygenation of blood above baseline lowers the concen-
tration of deoxy-Hb, which decreases local spin dephasing and results in a higher MR
signal. This means that fMRI is directly sensitive to the relative amount of oxy- and de-
oxy Hb and to the fraction of cerebral tissue that is occupied by blood (the CBV), which
are controlled by local neurovascular coupling processes. Neurovascular processes, in
40
Causal analysis of fMRI
turn, are tightly coupled to neurometabolic processes controlling the rate of oxidative
glucose metabolism (the CMRO2) that is needed to fuel neural activity.
Naively one might expect local neuronal activity to quickly increase CMRO2 and
increase the local concentration of deoxy-Hb, leading to a lowering of the MR signal.
However, this transient increase in deoxy-Hb or the initial dip in the fMRI signal is not
consistently observed and, thus, there is a debate whether this signal is robust, elusive or
simply not existent (Buxton, 2001; Ugurbil et al., 2003; Uludag, 2010). Instead, early
experiments showed that the dynamics of blood flow and blood volume, the hemody-
namics, lead to a robust BOLD signal increase. Neuronal activity is quickly followed
by a large CBF increase that serves the continued functioning of neurons by clearing
metabolic by-products (such as CO2) and supplying glucose and oxy-Hb. This CBF
response is an overcompensating response, supplying much more oxy-Hb to the local
blood system than has been metabolized. As a consequence, within 1-2 seconds, the
oxygenation of the blood increases and the MR signal increases. The increased flow
also induces a ‘ballooning’ of the blood vessels, increasing CBV, the proportion of
volume taken up by blood, further increasing the signal.
41
Roebroeck Seth Valdes-Sosa
∆S A q B
= V0 · k1 · (1 − q) + k2 · (1 − ) + k3 · (1 − v) (1)
S v
1 5 6
v̇t = ft − v1/α
t (2)
τ0
5 1/ ft
6
1 ft 1 − (1 − E0 ) qt
q̇t = − 1−1/α (3)
τ0 E0 vt
The term E0 is the resting oxygen extraction fraction, V0 is the resting blood volume
fraction, τ0 is the mean transit time of the venous compartment, α is the stiffness com-
ponent of the model balloon and {k1 , k2 , k3 } are calibration parameters. The main sim-
plifications of this model with respect to a more complete balloon model (Buxton et al.,
2004) are a one-to-one coupling of flow and volume in (2), thus neglecting the actual
balloon effect, and a perfect coupling between flow and metabolism in (3). Friston
et al. (2000) augment this model with a putative relation between the a neuronal ac-
tivity variable z, a flow-inducing signal s, and the normalized cerebral blood flow f .
They propose the following relations in which neuronal activity z causes an increase in
a vasodilatory signal that is subject to autoregulatory feedback:
1 1
ṡt = zt − st − 2 ( ft − 1) (4)
τs τf
f˙t = st (5)
Here τ s is the signal decay time constant, τ f is the time-constant of the feedback au-
toregulatory mechanism1 , and f is the flow normalized to baseline flow. The physio-
logical interpretation of the autoregulatory mechanism is unspecified, leaving us with
a neuronal activity variable z that is measured in units of s−2 . The physiology of the
hemodynamics contained in differential equations (2) to (5), on the other hand, is more
readily interpretable, and when integrated for a brief neuronal input pulse shows the
behavior as described above (Figure 2A, upper panel). This simulation highlights a
few crucial features. First, the hemodynamic response to a brief neural activity event
is sluggish and delayed, entailing that the fMRI BOLD signal is a delayed and low-
pass filtered version of underlying neuronal activity. More than the distorting effects of
hemodynamic processes on the temporal structure of fMRI signals per se, it is the dif-
ference in hemodynamics in different parts of the brain that forms a severe confound for
dynamic brain connectivity models. Particularly, the delay imposed upon fMRI signals
with respect to the underlying neural activity is known to vary between subjects and
between different brain regions of the same subject (Aguirre et al., 1998; Saad et al.,
2001). Second, although CBF, CBV and deoxyHb changes range in the tens of per-
cents, the BOLD signal change at 1.5T or 3T is in the range of 0.5-2%. Nevertheless,
1. Note that we have reparametrized the equation here in terms of τ f 2 to make τ f a proper time constant
in units of seconds
42
Causal analysis of fMRI
43
Roebroeck Seth Valdes-Sosa
at lower field strengths. The cost of this greater specificity and higher effective spatial
resolution is that SE-BOLD has a lower intrinsic SNR than GRE-BOLD. The balloon
model equations above are specific to GRE-BOLD at 1.5T and 3T and have been ex-
tended to reflect diffusion effects for higher field strengths (Uludag et al., 2009).
In summary, fMRI is an indirect measure of neuronal and synaptic activity. The
physiological quantities directly determining signal contrast in BOLD fMRI are hemo-
dynamic quantities such as cerebral blood flow and volume and oxygen metabolism.
fMRI can achieve a excellent spatial resolution (millimeters down to hundreds of mi-
crometers at high field strength) with good temporal resolution (seconds down to hun-
dreds of milliseconds). The potential to resolve neuronal population interactions at a
high spatial resolution is what drives attempts at causal time series modeling of fMRI
data. However, the significant aspects of fMRI that pose challenges for such attempts
are i) the enormous dimensionality of the data that contains hundreds of thousands of
channels (voxels) ii) the temporal convolution of neuronal events by sluggish hemody-
namics that can differ between remote parts of the brain and iii) the relatively sparse
temporal sampling of the signal.
44
Causal analysis of fMRI
In fact, Granger distinguished true causal relations – only to be inferred in the pres-
ence of knowledge of the state of the whole universe – from “prima facie" causal rela-
tions that we refer to as “influence" in agreement with other authors (Commenges and
Gegout-Petit, 2009). Almost simultaneous with Grangers work, Akaike (Akaike, 1968),
and Schweder (Schweder, 1970) introduced similar concepts of influence, prompting
(Valdes-Sosa et al., in press) to coin the term WAGS influence (for Wiener-Akaike-
Granger-Schweder). This is a generalization of a proposal by placeAalen (Aalen, 1987;
Aalen and Frigessi, 2007) who was among the first to point out the connections be-
tween Granger’s and Schweder’s influence concepts. Within this framework we can
define several general types of WAGS influence, which are applicable to both Marko-
vian and non-Markovian processes, in discrete or continuous time.
For three vector time series X1 (t) , X2 (t) , X3 (t) we wish to know if time series X1 (t)
is influenced by time series X2 (t) conditional on X3 (t). Here X3 (t) can be considered
any set of relevant time series to be controlled for. Let X [a, b] = {X (t) , t ∈ [a, b]} denote
the history of a time series in the discrete or continuous time interval [a, b] The first
categorical distinction is based on what part of the present or future of X1 (t) can be pre-
dicted by the past or present of X2 (τ2 ) τ2 ≤ t. This leads to the following classification
(Florens, 2003; Florens and Fougere, 1996):
1. If X2 (τ2 ) : τ2 < t , can influence any future value of X1 (t) it is a global influence.
P ( X1 (∞, t]| X1 (t, −∞], X2 (t, −∞], X3 (t, −∞]) = P ( X1 (∞, t]| X1 (t, −∞], X3 (t, −∞])
That is: the probability distribution of the future values of X1 does not depend on the
past of X2 , given that the influence of the past of both X1 and X3 has been taken into
account. When this condition does not hold we say X2 (t) strongly, conditionally, and
globally influences (SCGi) X1 (t) given X3 (t). Here we use a convention for inter-
vals [a,b) which indicates that the left endpoint is included but not the right and that
b precedes a. Note that the whole future of Xt is included (hence the term “global").
And the whole past of all time series is considered. This means these definitions ac-
commodate non-Markovian processes (for Markovian processes, we only consider the
previous time point). Furthermore, these definitions do not depend on an assumption
of linearity or any given functional relationship between time series. Note also that
45
Roebroeck Seth Valdes-Sosa
Strong Weak
(Probability Distribution) (Expectation)
Global By absence of By absence of
(All horizons) strong, conditional, global weak, conditional, global in-
independence: dependence:
X2 (t)SCGi X1 (t)||X3 (t) X2 (t)WCGi X1 (t)||X3 (t)
Local By absence of By absence of
(Immediate strong, conditional, local in- weak, conditional, local in-
future) dependence: dependence:
X2 (t)SCLi X1 (t)||X3 (t) X2 (t)WCLi X1 (t)||X3 (t)
Contemporaneous By absence of By absence of
strong, conditional, contem- weak, conditional, contem-
poraneous independence: poraneous independence:
X2 (t)SCCi X1 (t)||X3 (t) X2 (t)WCCi X1 (t)||X3 (t)
this definition is appropriate for point processes, discrete and continuous time series,
even for categorical (qualitative valued) time series. The only problem with this for-
mulation is that it calls on the whole probability distribution and therefore its practical
assessment requires the use of measures such as mutual information that estimate the
probability densities nonparametrically.
As an alternative, weak concepts of influence can be defined based on expectations.
Consider weak conditional local independence in discrete time, which is defined:
E [ X1 [t + ∆t]| X1 [t, −∞] , X2 [t, −∞], X3 [t, −∞]] = E [ X1 [t + ∆t]| X1 [t, −∞], X3 [t, −∞]]
(7)
When this condition does not hold we say X2 weakly, conditionally and locally in-
fluences (WCLi) X1 given X3 . To make the implementation this definition insightful,
consider a discrete first-order vector auto-regressive (VAR) model for X = [X1 X2 X3 ]:
For this case E [ X[t + ∆t]| X[t, −∞]] = AX [t], and analyzing influence reduces to find-
ing which of the autoregressive coefficients are zero. Thus, many proposed operational
tests of WAGS influence, particularly in fMRI analysis, have been formulated as tests
of discrete autoregressive coefficients, although not always of order 1. Within the same
model one can operationalize weak conditional instantaneous independence in dis-
46
Causal analysis of fMRI
crete time as zero off-diagonal entries in the co-variance matrix of the innovations e[t]:
C D
Σe = cov [X [t + ∆t] |X [t, −∞]] = E X [t + ∆t] X ' [t + ∆t] |X [t, −∞]
E [ Y1 [t]| Y1 (t, −∞], Y2 (t, −∞], Y3 (t, −∞]] = E [ Y1 [t]| Y1 (t, −∞], Y3 (t, −∞]] (9)
Now consider a first-order stochastic differential equation (SDE) model for Y = [Y1 Y2 Y3 ]:
dY = BYdt + dω (10)
Then, since ω is a Wiener process with zero-mean white Gaussian noise as a derivative,
E [ Y[t]| Y(t, −∞]] = B Y (t)and analysing influence amounts to estimating the parameters
B of the SDE. However, if one were to observe a discretely sampled versionX[k] =
Y (k∆t) at sampling interval ∆tand model this with the discrete autoregressive model
above, this would be inadequate to estimate the SDE parameters for large ∆t, since the
exact relations between continuous and discrete system matrices are known to be:
+
∞ i
A = eB∆t = I + ∆ti! Bi
E t+∆t +i=1 (11)
Σe = t eBs ω eBs ds
The power series expansion of the matrix exponential in the first line shows A to be
a weighted sum of successive matrix powers Bi of the continuous time system matrix.
Thus, the Awill contain contributions from direct (in B) and indirect (in i steps inBi )
causal links between the modeled areas. The contribution of the more indirect links is
progressively down-weighted with the number of causal steps from one area to another
and is smaller when the sampling interval ∆t is smaller. This makes clear that multivari-
ate discrete signal models have some undesirable properties for coarsely sampled sig-
nals (i.e. a large ∆t with respect to the system dynamics), such as fMRI data. Critically,
entirely ruling out indirect influences is not actually achieved merely by employing a
multivariate discrete model. Furthermore, estimated WAGS influence (particularly the
relative contribution of indirect links) is dependent on the employed sampling inter-
val. However, the discrete system matrix still represents the presence and direction of
influence, possibly mediated through other regions.
When the goal is to estimate WAGS influence for discrete data starting from a con-
tinuous time model, one has to model explicitly the mapping to discrete time. Mapping
continuous time predictions to discrete samples is a well known topic in engineering
and can be solved by explicit integration over discrete time steps as performed in (11)
above. Although this defines the mapping from continuous to discrete parameters, it
does not solve the reverse assignment of estimating continuous model parameters from
discrete data. Doing so requires a solution to the aliasing problem (Mccrorie, 2003) in
continuous stochastic system identification by setting sufficient conditions on the ma-
trix logarithm function to make Babove identifiable (uniquely defined) in terms of A.
Interesting in this regard is a line of work initiated by Bergstrom (Bergstrom, 1966,
47
Roebroeck Seth Valdes-Sosa
1984) and Phillips (Phillips, 1973, 1974) studying the estimation of continuous time
Autoregressive models (McCrorie, 2002), and continuous time Autoregressive Moving
Average Models (Chambers and Thornton, 2009) from discrete data. This work rests
on the observation that the lag zero covariance matrix Σe will show contemporaneous
covariance even if the continuous covariance matrix Σω is diagonal. In other words,
the discrete noise becomes correlated over the discrete time-series because the random
fluctuations are aggregated over time. Rather than considering this a disadvantage, this
approach tries to use both lag information (the AR part) and zero-lag covariance infor-
mation to identify the underlying continuous linear model.
Notwithstanding the desirability of a continuous time model for consistent infer-
ence on WAGS influence, there are a few invariances of discrete VAR models, or more
generally discrete Vector Autoregressive Moving Average (VARMA) models that allow
their carefully qualified usage in estimating causal influence. The VAR formulation of
WAGS influence has the property of invariance under invertible linear filtering. More
precisely, a general measure of influence remains unchanged if channels are each pre-
multiplied with different invertible lag operators (Geweke, 1982). However, in practice
the order of the estimated VAR model would need to be sufficient to accommodate
these operators. Beyond invertible linear filtering, a VARMA formulation has further
invariances. Solo (2006) showed that causality in a VARMA model is preserved under
sampling and additive noise. More precisely, if both local and contemporaneous influ-
ence is considered (as defined above) the VARMA measure is preserved under sampling
and under the addition of independent but colored noise to the different channels. Fi-
nally, Amendola et al. (2010) shows the class of VARMA models to be closed under
aggregation operations, which include both sampling and time-window averaging.
48
Causal analysis of fMRI
v (t). In fMRI experiments the exogenous inputs v (t) mostly reflect experimental control
and often have the form of a simple indicator function that can represent, for instance,
the presence or absence of a visual stimulus. The vector-functions f and g can generally
be non-linear.
The state-space formalism allows representation of a very large class of stochastic
processes. Specifically, it allows representation of both so-called ‘black-box’ models,
in which parameters are treated as means to adjust the fit to the data without reflecting
physically meaningful quantities, and ‘grey-box’ models, in which the adjustable pa-
rameters do have a physical or physiological (in the case of the brain) interpretation. A
prominent example of a black-box model in econometric time-series analysis and sys-
tems identification is the (discrete) Vector Autoregressive Moving Average model with
exogenous inputs (VARMAX model) defined as (Ljung, 1999; Reinsel, 1997):
Here, the backshift operator B is defined, for any ηt as Bi ηt = ηt−i and F, G and L
+p
are polynomials in the backshift operator, such that e.g. F (B) = i=0 Fi Bi and p, s
and q are the dynamic orders of the VARMAX(p,s,q) model. The minimal constraints
on (13) to make it identifiable are F0 = L0 = I, which yields the standard VARMAX
representation. The VARMAX model and its various reductions (by use of only one
or two of the polynomials, e.g. VAR, VARX or VARMA models) have played a large
role in time-series prediction and WAGS influence modeling. Thus, in the context of
state space models it is important to consider that the VARMAX model form can be
equivalently formulated in a discrete linear state space form:
Again, the exact relations between the discrete and continuous state space parameter
matrices can be derived analytically by explicit integration over time (Ljung, 1999).
And, as discussed above, wherever discrete data is used to model continuous influence
relations the problems of temporal aggregation and aliasing have to be taken into ac-
count.
Although analytic solutions for the discretely sampled continuous linear systems
exist, the discretization of the nonlinear stochastic model (12) does not have a unique
global solution. However, physiological models of neuronal population dynamics and
hemodynamics are formulated in continuous time and are mostly nonlinear while fMRI
data is inherently discrete with low sampling frequencies. Therefore, it is the discretiza-
tion of the nonlinear dynamical stochastic models that is especially relevant to causal
49
Roebroeck Seth Valdes-Sosa
analysis of fMRI data. A local linearization approach was proposed by (Ozaki, 1992)
as bridge between discrete time series models and nonlinear continuous dynamical sys-
tems model. Considering the nonlinear state equation without exogenous input:
The essential assumption in local linearization (LL) of this nonlinear system is to con-
fl (X)
sider the Jacobian matrix J (l, m) = ∂∂X m
as constant over the time period [t + ∆t, t]. This
Jacobian plays the same role as the autoregressive matrix in the linear systems above.
Integration over this interval gives the solution:
where I is the identity matrix. Note integration should not be computed this way since
it is numerically unstable, especially when the Jacobian is poorly conditioned. A list
of robust and fast procedures is reviewed in (Valdes-Sosa et al., 2009). This solution is
locally linear but crucially it changes with the state at the beginning of each integration
interval; this is how it accommodates nonlinearity (i.e., a state-dependent autoregres-
sion matrix). As above, the discretized noise shows instantaneous correlations due to
the aggregation of ongoing dynamics within the span of a sampling period. Once again,
this highlights the underlying mechanism for problems with temporal sub-sampling and
aggregation for some discrete time models of WAGS influence.
50
Causal analysis of fMRI
estimate the trajectories of hidden neuronal processes from observed neuroimaging data
if one can formulate an accurate model of the processes leading from neuronal activity
to data records. A few years later, this idea was robustly transferred to fMRI data in the
form of DCM (Friston et al., 2003). DCM combines three ideas about causal influence
analysis in fMRI data (or neuroimaging data in general), which can be understood in
terms of the discussion of the fMRI signal and state space models above (Daunizeau
et al., 2009a).
First, neuronal interactions are best modeled at the level of unobserved (latent)
signals, instead of at the level of observed BOLD signals. This requires a state space
model with a dynamic model of neuronal population dynamics and interactions. The
original model that was formulated for the dynamics of neuronal states x = {x1 , . . . , xN }
is a bilinear ODE model:
'
ẋ = Ax + v j B j x + Cv (18)
That is, the noiseless neuronal dynamics are characterized by a linear term (with entries
in A representing intrinsic coupling between populations), an exogenous term (with C
representing driving influence of experimental variables) and a bilinear term (with B j
representing the modulatory influence of experimental variables on coupling between
populations). More recent work has extended this model, e.g. by adding a quadratic
term (Stephan et al., 2008), stochastic dynamics (Daunizeau et al., 2009b) or multiple
state variables per region (Marreiros et al., 2008).
Second, the latent neuronal dynamics are related to observed data by a generative
(forward) model that accounts for the temporal convolution of neuronal events by slow
and variably delayed hemodynamics. This generative forward model in DCM for fMRI
is exactly the (simplified) balloon model set out in section 2. Thus, for every selected
region a single state variable represents the neuronal or synaptic activity of a local pop-
ulation of neurons and (in DCM for BOLD fMRI) four or five more represent hemo-
dynamic quantities such as capillary blood volume, blood flow and deoxy-hemoglobin
content. All state variables (and the equations governing their dynamics) that serve
the mapping of neuronal activity to the fMRI measurements (including the observation
equation) can be called the observation model. Most of the physiologically motivated
generative model in DCM for fMRI is therefore concerned with an observation model
encapsulating hemodynamics. The parameters in this model are estimated conjointly
with the parameters quantifying neuronal connectivity. Thus, the forward biophysical
model of hemodynamics is ‘inverted’ in the estimation procedure to achieve a deconvo-
lution of fMRI time series and obtain estimates of the underlying neuronal states. DCM
has also been applied to EEG/MEG, in which case the observation model encapsulates
the lead-field matrix from neuronal sources to EEG electrodes or MEG sensors (Kiebel
et al., 2009).
Third, the approach to estimating the hidden state trajectories (i.e. filtering and
smoothing) and parameter values in DCM is cast in a Bayesian framework. In short,
Bayes’ theorem is used to combine priors p(Θ|M)and likelihood p (y|Θ, M)into the
51
Roebroeck Seth Valdes-Sosa
Here, the model M is understood to define the priors on all parameters and the like-
lihood through the generative models for neuronal dynamics and hemodynamics. A
posterior for the parameters p (Θ|y, M) can be obtained as the distribution over param-
eters which maximizes the evidence (19). Since this optimization problem has no ana-
lytic solution and is intractable with numerical sampling schemes for complex models,
such as DCM, approximations must be used. The inference approach for DCM relies
on variational Bayes methods (Beal, 2003) that optimize an approximation density q(Θ)
to the posterior. The approximation density is taken to have a Gaussian form, which is
often referred to as the “Laplace approximation” (Friston et al., 2007). In addition to
the approximate posterior on the parameters, the variational inference will also result
into a lower bound on the evidence, sometimes referred to as the “free energy”. This
lower bound (or other approximations to the evidence, such as the Akaike Informa-
tion Criterion or the Bayesian Information Criterion) are used for model comparison
(Penny et al., 2004). Importantly, these quantities explicitly balance goodness-of-it
against model complexity as a means of avoiding overfitting.
An important limiting aspect of DCM for fMRI is that the models M that are com-
pared also (implicitly) contain an anatomical model or structural model that contains i)
a selection of the ROIs in the brain that are assumed to be of importance in the cognitive
process or task under investigation, ii) the possible interactions between those structures
and iii) the possible effects of exogenous inputs onto the network. In other words, each
model M specifies the nodes and edges in a directed (possibly cyclic) structural graph
model. Since the anatomical model also determines the selected part y of the total
dataset (all voxels) one cannot use the evidence to compare different anatomical mod-
els. This is because the evidence of different anatomical models is defined over different
data. Applications of DCM to date invariably use very simple anatomical models (typ-
ically employing 3-6 ROIs) in combination with its complex parameter-rich dynamical
model discussed above. The clear danger with overly simple anatomical models is that
of spurious influence: an erroneous influence found between two selected regions that
in reality is due to interactions with additional regions which have been ignored. Pro-
totypical examples of spurious influence, of relevance in brain connectivity, are those
between unconnected structures A and B that receive common input from, or are inter-
vened by, an unmodeled region C.
52
Causal analysis of fMRI
that estimation of mathematical models from time-series data generally has two im-
portant aspects: model selection and model identification (Ljung, 1999). In the model
selection stage a class of models is chosen by the researcher that is deemed suitable
for the problem at hand. In the model identification stage the parameters in the chosen
model class are estimated from the observed data record. In practice, model selection
and identification often occur in a somewhat interactive fashion where, for instance,
model selection can be informed by the fit of different models to the data achieved in
an identification step. The important point is that model selection involves a mixture
of choices and assumptions on the part of the researcher and the information gained
from the data-record itself. These considerations indicate that an important distinction
must be made between exploratory and confirmatory approaches, especially in struc-
tural model selection procedures for brain connectivity. Exploratory techniques use
information in the data to investigate the relative applicability of many models. As
such, they have the potential to detect ‘missing’ regions in structural models. Confir-
matory approaches, such as DCM, test hypotheses about connectivity within a set of
models assumed to be applicable. Sources of common input or intervening causes are
taken into account in a multivariate confirmatory model, but only if the employed struc-
tural model allows it (i.e. if the common input or intervening node is incorporated in
the model).
The technique of Granger Causality Mapping (GCM) was developed to explore
all regions in the brain that interact with a single selected reference region using au-
toregressive modeling of fMRI time-series (Roebroeck et al., 2005). By employing a
simple bivariate model containing the reference region and, in turn, every other voxel in
the brain, the sources and targets of influence for the reference region can be mapped.
It was shown that such an ‘exploratory’ mapping approach can form an important tool
in structural model selection. Although a bivariate model does not discern direct from
indirect influences, the mapping approach locates potential sources of common input
and areas that could act as intervening network nodes. In addition, by settling for
a bivariate model one trivially avoids the conflation of direct and indirect influences
that can arise in discrete AR model due to temporal aggregation, as discussed above.
Other applications of autoregressive modeling to fMRI data have considered full mul-
tivariate models on large sets of selected brain regions, illustrating the possibility to
estimate high-dimensional dynamical models. For instance, Valdes-Sosa (2004) and
Valdes-Sosa et al. (2005b) applied these models to parcellations of the entire cortex in
conjunction with sparse regression approaches that enforce an implicit structural model
selection within the set of parcels. In another more recent example (Deshpande et al.,
2008) a full multivariate model was estimated over 25 ROIs (that were found to be ac-
tivated in the investigated task) together with an explicit reduction procedure to prune
regions from the full model as a structural model selection procedure. Additional vari-
ants of VAR model based causal inference that has been applied to fMRI include time
varying influence (Havlicek et al., 2010), blockwise (or ‘cluster-wise’) influence from
one group of variables to another (Barrett et al., 2010; Sato et al., 2010) and frequency-
decomposed influence (Sato et al., 2009).
53
Roebroeck Seth Valdes-Sosa
54
Causal analysis of fMRI
charges (SWDs) spread through the brain. fMRI was used to map the hemodynamic
response throughout the brain to seizure activity, where ictal and interictal states were
quantified by the simultaneously recorded EEG. Three structures were selected by the
authors as the crucial nodes in the network that generates and sustains seizure activ-
ity and further analysed with i) DCM, ii) simple AR modeling of the fMRI signal and
iii) AR modeling applied to neuronal state-variable estimates obtained with a hemo-
dynamic deconvolution step. By applying G-causality analysis to deconvolved fMRI
time-series, the stochastic dynamics of the linear state-space model are augmented with
the complex biophysically motivated observation model in DCM. This step is crucial
if the goal is to compare the dynamic connectivity models and draw conclusions on
the relative merits of linear stochastic models (explicitly estimating WAGS influence)
and bilinear deterministic models. The results showed both AR analysis after decon-
volution and DCM analysis to be in accordance with the gold-standard iEEG analyses,
identifying the most pertinent influence relations undisturbed by variations in HRF la-
tencies. In contrast, the final result of simple AR modeling of the fMRI signal showed
less correspondence with the gold standard, due to the confounding effects of different
hemodynamic latencies which are not accounted for in the model.
Two important lessons can be drawn from David et al.’s study and the ensuing
discussions (Bressler and Seth, 2010; Daunizeau et al., 2009a; David, 2009; Friston,
2009b,a; Roebroeck et al., 2009a,b). First, it confirms again the distorting effects of
hemodynamic processes on the temporal structure of fMRI signals and, more impor-
tantly, that the difference in hemodynamics in different parts of the brain can form a
confound for dynamic brain connectivity models (Roebroeck et al., 2005). Second,
state-space models that embody observation models that connect latent neuronal dy-
namics to observed fMRI signals have a potential to identify causal influence unbiased
by this confound. As a consequence, substantial recent methodological work has aimed
at combining different models of latent neuronal dynamics with a form of a hemody-
namic observation model in order to provide an inversion or filtering algorithm for esti-
mation of parameters and hidden state trajectories. Following the original formulation
of DCM that provides a bilinear ODE form for the hidden neuronal dynamics, attempts
have been made at explicit integration of hemodynamics convolution with stochastic
dynamic models that are interpretable in the framework of WAGS influence.
For instance in (Ryali et al., 2010), following earlier work (Penny et al., 2005;
Smith et al., 2009), a discrete state space model with a bi-linear vector autoregressive
model to quantify dynamic neuronal state evolution and both intrinsic and modulatory
interactions is proposed:
+ j j
xk = Axk−1 + j=1 vk B j xk−1 + Cvk + εk
G H
xm m m m
k = xk , xk−1 , · · · , xk−L+1
(20)
ym m m
k = β Φxk + ek
m
Here, we index exogenous inputs with j and ROIs with min superscripts. The en-
tries in the autoregressive matrix A, exogenous influence matrix C and bi-linear ma-
trices B j have the same interpretation as in deterministic DCM. The relation between
55
Roebroeck Seth Valdes-Sosa
56
Causal analysis of fMRI
does better than all other models on all counts. Nonetheless, the ongoing development
efforts towards improved approaches are continually extending and generalizing the
contexts in which dynamic time series models can be applied. It is clear that state space
modeling and inference on WAGS influence are fundamental concepts within this en-
deavor. We end here with some considerations of dynamic brain connectivity models
that summarize some important points and anticipate future developments.
We have emphasized that WAGS influence models of brain connectivity have largely
been aimed at data driven exploratory analysis, whereas biophysically motivated state
space models are mostly used for hypothesis-led confirmatory analysis. This is es-
pecially relevant in the interaction between model selection and model identification.
Exploratory techniques use information in the data to investigate the relative applica-
bility of many models. As such, they have the potential to detect ‘missing’ regions in
anatomical models. Confirmatory approaches test hypotheses about connectivity within
a set of models assumed to be applicable.
As mentioned above, the WAGS influence approach to statistical analysis of causal
influence that we focused on here is complemented by the interventional approach
rooted in the theory of graphical models and causal calculus. Graphical causal mod-
els have been recently applied to brain connectivity analysis of fMRI data (Ramsey
et al., 2009). Recent work combining the two approaches (White and Lu, 2010) possi-
bly leads the way to a combined causal treatment of brain imaging data incorporating
dynamic models and interventions. Such a combination could enable incorporation of
direct manipulation of brain activity by (for example) transcranial magnetic stimulation
(Pascual-Leone et al., 2000; Paus, 1999; Walsh and Cowey, 2000) into the current state
space modeling framework.
Causal models of brain connectivity are increasingly inspired by biophysical the-
ories. For fMRI this is primarily applicable in modeling the complex chain of events
separating neuronal population activity from the BOLD signal. Inversion of such a
model (in state space form) by a suitable filtering algorithm amounts to a model-based
deconvolution of the fMRI signal resulting in an estimate of latent neuronal population
activity. If the biophysical model is appropriately formulated to be identifiable (possi-
bly including priors on relevant parameters), it can take variation in the hemodynamics
between brain regions into account that can otherwise confound time series causality
analyses of fMRI signals. Although models of hemodynamics for causal fMRI anal-
ysis have reached a reasonable level of complexity, the models of neuronal dynamics
used to date have remained simple, comprising one or two state variables for an entire
cortical region or subcortical structure. Realistic dynamic models of neuronal activity
have a long history and have reached a high level of sophistication (Deco et al., 2008;
Markram, 2006). It remains an open issue to what degree complex realistic equation
systems can be embedded in analysis of fMRI – or in fact: any brain imaging modality
– and result in identifiable models of neuronal connectivity and computation.
Two recent developments create opportunities to increase complexity and realism
of neuronal dynamics models and move the level of modeling from the macroscopic
(whole brain areas) towards the mesoscopic level comprising sub-populations of areas
57
Roebroeck Seth Valdes-Sosa
and cortical columns. First, the fusion of multiple imaging modalities, possibly simul-
taneously recorded, has received a great deal of attention. Particularly, several attempts
to model-driven fusion of simultaneousy recorded fMRI and EEG data, by inverting a
separate observation model for each modality while using the same underlying neuronal
model, have been reported (Deneux and Faugeras, 2010; Riera et al., 2007; Valdes-Sosa
et al., 2009). This approach holds great potential to fruitfully combine the superior spa-
tial resolution of fMRI with the superior temporal resolution of EEG. In (Valdes-Sosa
et al., 2009) anatomical connectivity information obtained from diffusion tensor imag-
ing and fiber tractography is also incorporated. Second, advances in MRI technology,
particularly increases of main field strength to 7T (and beyond) and advances in parallel
imaging (de Zwart et al., 2006; Heidemann et al., 2006; Pruessmann, 2004; Wiesinger
et al., 2006), greatly increase the level spatial detail that are accessible with fMRI. For
instance, fMRI at 7T with sufficient spatial resolution to resolve orientation columns in
human visual cortex has been reported (Yacoub et al., 2008).
The development of state space models for causal analysis of fMRI data has moved
from discrete to continuous and from deterministic to stochastic models. Continuous
models with stochastic dynamics have desirable properties, chief among them a ro-
bust inference on causal influence interpretable in the WAGS framework, as discussed
above. However, dealing with continuous stochastic models leads to technical issues
such as the properties and interpretation of Wiener processes and Ito calculus (Friston,
2008). A number of inversion or filtering methods for continuous stochastic models
have been recently proposed, particularly for the goal of causal analysis of brain imag-
ing data, including the local linearization and innovations approach (Hernandez et al.,
1996; Riera et al., 2004), dynamic expectation maximization (Friston et al., 2008) and
generalized filtering (Friston et al., 2010). The ongoing development of these filtering
methods, their validation and their scalability towards large numbers of state variables
will be a topic of continuing research.
Acknowledgments
The authors thank Kamil Uludag for comments and discussion.
References
Odd O. Aalen. Dynamic modeling and causality. Scandinavian Actuarial journal,
pages 177–190, 1987.
O.O. Aalen and A. Frigessi. What can statistics contribute to a causal understanding?
Board of the Foundation of the Scandinavian journal of Statistics, 34:155–168, 2007.
H Akaike. On the use of a linear model for the identification of feedback systems.
Annals of the Institute of statistical mathematics, 20(1):425–439, 1968.
58
Causal analysis of fMRI
D. Attwell and C. Iadecola. The neural basis of functional brain imaging signals. Trends
Neurosci, 25(12):621–5, 2002.
M.J. Beal. Variational Algorithms for Approximate Bayesian Inference. PhD thesis,
University College London, 2003.
M.A. Bernstein, K.F. King, and X.J. Zhou. Handbook of MRI Pulse Sequences. Elsevier
Academic Press, urlington, 2004.
59
Roebroeck Seth Valdes-Sosa
O. David. fmri connectivity, meaning and empiricism comments on: Roebroeck et al.
the identification of interacting networks in the brain using fmri: Model selection,
causality and deconvolution. Neuroimage, 2009.
60
Causal analysis of fMRI
K. Friston. Beyond phrenology: what can neuroimaging tell us about distributed cir-
cuitry? Annu Rev Neurosci, 25:221–50, 2002.
K. Friston. Dynamic causal modeling and granger causality comments on: The iden-
tification of interacting networks in the brain using fmri: Model selection, causality
and deconvolution. Neuroimage, 2009a.
Karl Friston. Hierarchical models in the brain. PLoS Computational Biology, 4, 2008.
Karl Friston. Causal modelling and brain connectivity in functional magnetic resonance
imaging. PLoS biology, 7:e33, 2009b.
Karl Friston, Klaas Stephan, Baojuan Li, and Jean Daunizeau. Generalised filtering.
Mathematical Problems in Engineering, 2010:1–35, 2010.
C. Glymour. Learning, prediction and causal bayes nets. Trends Cogn Sci, 7(1):43–48,
2003.
E.M. Haacke, R.W. Brown, M.R. Thompson, and R. Venkatesan. Magnetic Resonance
Imaging: Physical Principles and Sequence Design. John Wiley and Sons, Inc, New
York, 1999.
61
Roebroeck Seth Valdes-Sosa
J. L. Hernandez, P. A. Valdés, and P. Vila. Eeg spike and wave modelled by a stochastic
limit cycle. NeuroReport, 1996.
H. Johansen-Berg and T.E.J Behrens, editors. Diffusion MRI: From quantitative mea-
surement to in-vivo neuroanatomy. Academic Press, London, 2009.
D.K. Jones, editor. Diffusion MRI: Theory, Methods, and Applications. Oxford Univer-
sity Press, Oxford, 2010.
L. Ljung. System Identification: Theory for the User. Prentice-Hall, New Jersey, 2nd
edition, 1999.
N. K. Logothetis. What we can do and what we cannot do with fmri. Nature, 453
(7197):869–78, 2008.
H. Markram. The blue brain project. Nat Rev Neurosci, 7(2):153–60, 2006.
J. Roderick McCrorie. The likelihood of the parameters of a continuous time vector au-
toregressive model. Statistical Inference for Stochastic Processes, 5:273–286, 2002.
62
Causal analysis of fMRI
T Ozaki. A bridge between nonlinear time series models and nonlinear stochastic dy-
namical systems: A local linearization approach. Statistica Sinica, 2:113–135, 1992.
T. Paus. Imaging the brain before, during, and after transcranial magnetic stimulation.
Neuropsychologia, 37(2):219–24, 1999.
Peter C.B. Phillips. The problem of identification in finite parameter continuous time
models. journal of Econometrics, 1:351–362, 1973.
Peter C.B. Phillips. The estimation of some continuous time models. Econometrica,
42:803–823, 1974.
K. P. Pruessmann. Parallel imaging at high field strength: synergies and joint potential.
Top Magn Reson Imaging, 15(4):237–44, 2004.
63
Roebroeck Seth Valdes-Sosa
A. Roebroeck, E. Formisano, and R. Goebel. Mapping directed influence over the brain
using granger causality and fmri. Neuroimage, 25(1):230–42, 2005.
A. Roebroeck, E. Formisano, and R. Goebel. The identification of interacting networks
in the brain using fmri: Model selection, causality and deconvolution. Neuroimage,
2009a.
A. Roebroeck, E. Formisano, and R. Goebel. Reply to friston and david after com-
ments on: The identification of interacting networks in the brain using fmri: Model
selection, causality and deconvolution. Neuroimage, 2009b.
S. Ryali, K. Supekar, T. Chen, and V. Menon. Multivariate dynamical systems models
for estimating causal interactions in fmri. Neuroimage, 2010.
Z. S. Saad, K. M. Ropella, R. W. Cox, and E. A. DeYoe. Analysis and use of fmri
response delays. Hum Brain Mapp, 13(2):74–93, 2001.
R. Salmelin and J. Kujala. Neural representation of language: activation versus long-
range connectivity. Trends Cogn Sci, 10(11):519–25, 2006.
J. R. Sato, D. Y. Takahashi, S. M. Arcuri, K. Sameshima, P. A. Morettin, and L. A.
Baccala. Frequency domain connectivity identification: an application of partial
directed coherence in fmri. Hum Brain Mapp, 30(2):452–61, 2009.
J. R. Sato, A. Fujita, E. F. Cardoso, C. E. Thomaz, M. J. Brammer, and Jr. E. Amaro.
Analyzing the connectivity between regions of interest: an approach based on cluster
granger causality for fmri data analysis. Neuroimage, 52(4):1444–55, 2010.
M. B. Schippers, A. Roebroeck, R. Renken, L. Nanetti, and C. Keysers. Mapping the
information flow from one brain to another during gestural communication. Proc
Natl Acad Sci U S A, 107(20):9388–93, 2010.
T Schweder. Composable markov processes. journal of Applied Probability, 7(2):
400–410, 1970.
J. F. Smith, A. Pillai, K. Chen, and B. Horwitz. Identification and validation of effec-
tive connectivity networks in functional magnetic resonance imaging using switching
linear dynamic systems. Neuroimage, 52(3):1027–40, 2009.
S. M. Smith, K. L. Miller, G. Salimi-Khorshidi, M. Webster, C. F. Beckmann, T. E.
Nichols, J. D. Ramsey, and M. W. Woolrich. Network modelling methods for fmri.
Neuroimage, 2010.
V Solo. On causality i: Sampling and noise. Proceedings of the 46th IEEE Conference
on Decision and Control, pages 3634–3639, 2006.
D. Sridharan, D. J. Levitin, and V. Menon. A critical role for the right fronto-insular
cortex in switching between central-executive and default-mode networks. Proc Natl
Acad Sci U S A, 105(34):12569–74, 2008.
64
Causal analysis of fMRI
65
Roebroeck Seth Valdes-Sosa
Halbert White and Xun Lu. Granger causality and dynamic structural systems. Journal
of Financial Econometrics, 8(2):193–243, 2010.
66
JMLR: Workshop and Conference Proceedings 12:30–64, 2011 Causality in Time Series
Abstract
A widely agreed upon definition of time series causality inference, established in the
seminal 1969 article of Clive Granger (1969), is based on the relative ability of the his-
tory of one time series to predict the current state of another, conditional on all other
past information. While the Granger Causality (GC) principle remains uncontested,
its literal application is challenged by practical and physical limitations of the process
of discretely sampling continuous dynamic systems. Advances in methodology for
time-series causality subsequently evolved mainly in econometrics and brain imag-
ing: while each domain has specific data and noise characteristics the basic aims and
challenges are similar. Dynamic interactions may occur at higher temporal or spatial
resolution than our ability to measure them, which leads to the potentially false infer-
ence of causation where only correlation is present. Causality assignment can be seen
as the principled partition of spectral coherence among interacting signals using both
auto-regressive (AR) modelling and spectral decomposition. While both approaches
are theoretically equivalent, interchangeably describing linear dynamic processes, the
purely spectral approach currently differs in its somewhat higher ability to accurately
deal with mixed additive noise.
Two new methods are introduced 1) a purely auto-regressive method named Causal
Structural Information is introduced which unlike current AR-based methods is ro-
bust to mixed additive noise and 2) a novel means of calculating multivariate spectra
for unevenly sampled data based on cardinal trigonometric functions is incorporated
into the recently introduced phase slope index (PSI) spectral causal inference method
(Nolte et al., 2008). In addition to these, PSI, partial coherence-based PSI and existing
AR-based causality measures were tested on a specially constructed data-set simulat-
ing possible confounding effects of mixed noise and another additionally testing the
influence of common, background driving signals. Tabulated statistics are provided,
in which true causality influence is subjected to an acceptable level of false inference
probability. The new methods as well as PSI are shown to allow reliable inference
for signals as short as 100 points and to be robust to additive colored mixed noise and
to the influence commonly coupled driving signals, as well as provide for a useful
measure of strength of causal influence.
Keywords: Causality, spectral decomposition, cross-correlation, auto regressive mod-
els.
c 2011 F. Popescu.
!
Popescu
1. Introduction
Causality is the sine qua non of scientific inference methodology, allowing us, among
other things to advocate effective policy, diagnose and cure disease and explain brain
function. While it has recently attracted much interest within Machine Learning, it
bears reminding that a lot of this recent effort has been directed toward static data rather
than time series. The ‘classical’ statisticians of the early 20th century, such as Fisher,
Gosset and Karl Pearson, aimed at a rational and general recipe for causal inference and
discovery (Gigerenzer et al., 1990) but the tools they developed applied to simple types
of inference which required the pres-selection, through consensus or by design, of a
handful of candidate causes (or ‘treatments’) and a handful of subsequently occurring
candidate effects. Numerical experiments yielded tables which were intended to serve
as a technician’s almanac (Pearson, 1930; Fisher, 1925), and are today an essential part
of the vocabulary of scientific discourse, although tables have been replaced by precise
formulae and specialized software. These methods rely on removing possible causal
links at a certain ‘significance level’, on the basic premise that a twin experiment on
data of similar size generated by a hypothetical non-causal mechanism would yield a
result of similar strength only with a known (small) probability. While it may have
been hoped that a generalization of the statistical test of difference among population
means (e.g. the t-test) to the case of time series causal structure may be possible using
a similar almanac or recipe book approach, in reality causality has proven to be a much
more contentious - and difficult - issue.
Time series theory and analysis immediately followed the development of classical
statistics (Yule, 1926; Wold, 1938) and was spurred thereafter by exigence (a severe
economic boom/bust cycle, an intense high-tech global conflict) as well as opportu-
nity (the post-war advent of a machine able to perform large linear algebra calcula-
tions). From a wide historical perspective, Fisher’s ‘almanac’ has rendered the indus-
trial age more orderly and understandable. It can be argued, however, that the ‘scientific
method’, at least in its accounting/statistical aspects, has not kept up with the explosive
growth of data tabulated in history, geology, neuroscience, medicine, population dy-
namics, economics, finance and other fields in which causal structure is at best partially
known and understood, but is needed in order to cure or to advocate policy. While it
may have been hoped that the advent of the computer might give rise to an automatic
inference machine able to ‘sort out’ the ever-expanding data sphere, the potential of a
computer of any conceivable power to condense the world to predictable patterns has
long been proven to be shockingly limited by mathematicians such as Turing (Turing,
1936) and Kolmogorov (Kolmogorov and Shiryayev, 1992) - even before the ENIAC
was built. The basic problem reduces itself to the curse of dimensionality: being forced
to choose among combinations of members of a large set of hypotheses (Lanterman,
2001). Scientists as a whole took a more positive outlook, in line with post-war boom
optimism, and focused on accessible automatic inference problems. One of these was
scientists was Norbert Wiener, who, besides founding the field of cybernetics (the pre-
cursor of ML), introduced some of the basic tools of modern time-series analysis, a line
of research he began during wartime and focused on feedback control in ballistics. The
68
Robust Statistics for Causality
69
Popescu
general terms, if we are presented a scatter plot X1 vs. X2 which looks like a noisy
sine wave, we may reasonably infer that X2 causes X1 , since a given value of X2 ‘de-
termines’ X1 and not vice versa. We may even make some mild assumptions about the
noise process which superimposes on a functional relation ( X2 = X1 + additive noise
which is independent of X1 ) and by this means turn our intuition into a proper asym-
metric statistic, i.e. a controlled probability that X1 does not determine X2 , an approach
that has proven remarkably successful in some cases where the presence of a causal
relation was known but the direction was not (Hoyer et al., 2009). The challenge here
is that, unlike in traditional statistics, there is not simply the case of the null hypoth-
esis and its converse, but one of 4 mutually exclusive cases. A) X1 is independent of
X2 B) X1 causes X2 C) X2 causes X1 and D) X1 and X2 are observations of dependent
and non-causally related random variables (bidirectional information flow or feedback).
The appearance of a symmetric bijection (with additive noise) between X1 and X2 does
not mean absence of causal relation, as asymmetry in the apparent relations is merely a
clue and not a determinant of causality. Inference over static data is not without ambi-
guities without additional assumptions and requires observations of interacting triples
(or more) of variables as to allow somewhat reliable descriptions of causal relations or
lack thereof (see Guyon et al. (2010) for a more comprehensive overview). Statisti-
cal evaluation requires estimation of relative likelihood of various candidate models or
causal structures, including a null hypothesis of non-causality. In the case of complex
multidimensional data theoretical derivation of such probabilities is quite difficult, since
it is hard to analytically describe the class of dynamic systems we may be expected to
encounter. Instead, common ML practice consists in running toy experiments in which
the ‘ground truth’ (in our case, causal structure) is only known to those who run the
experiment, while other scientists aim to test their discovery algorithms on such data,
and methodological validity (including error rate) of any candidate method rests on its
ability to predict responses to a set of ‘stimuli’ (test data samples) available only to
the scientists organizing the challenge. This is the underlying paradigm of the Causal-
ity Workbench (Guyon, 2011). In time series causality, we fortunately have far more
information at our disposal relevant to causality than in the static case. Any type of
reasonable interpretation of causality implies a physical mechanism which accepts a
modifiable input and performs some operations in some finite time which then produce
an output and includes a source of randomness which gives it a stochastic nature, be it
inherent to the mechanism itself or in the observation process. Intuitively, the structure
or connectivity among input-output blocks that govern a data generating process are re-
lated to causality no matter (within limits) what the exact input-output relationships are:
this is what we mean by structural causality. However, not all structures of data gen-
erating processes are obviously causal, nor is it self evident how structure corresponds
to Granger (non) causality (GC), as shown in further detail by White and Lu (2010).
Granger causality is a measure of relative predictive information among variables and
not evidence of a direct physical mechanism linking the two processes: no amount of
analysis can exclude a latent unobserved cause. Strictly speaking the GC statistic is
70
Robust Statistics for Causality
2. Causality statistic
Causality inference is subject to a wider class of errors than classical statistics, which
tests independence among variables. A general hypothesis evaluation framework can
be:
5 6
Type I error prob. α = P Ĥa or Ĥb | H0 (1)
5 6
Type II error prob. β = P Ĥ0 | Ha or Hb
71
Popescu
5 6
Type III error prob. γ = P Ĥa |Hb or Ĥb |Ha
The notation Ĥ means that our statistical estimate of the estimated likelihood of H
exceeds the threshold needed for our decision to confirm it. This formulation carries
some caveats the justification for which is pragmatic and will be expounded upon in
later sections. The main one is the use of the term ‘drives’ in place of ‘causes’. The
null hypothesis can be viewed as equivalent to strong Granger non-causality (as it will
be argued is necessary), but it does not mean that the signals A and B are indepen-
dent: they may well be correlated to one another. Furthermore, we cannot realistically
aim at statistically supporting strict Granger causality, i.e. strictly one-sided causal
interaction, since asymmetry in bidirectional interaction may be more likely in real-
world observations and is equally meaningful. By ‘driving’ we mean instead that the
history of one time series element A is more useful to predicting the current state of
B than vice-versa, and not that the history of B is irrelevant to predicting A. In the
latter case we would specify ‘G-causes’ instead of ‘drives’ and for H0 we would em-
ploy non-parametric independence tests of Granger non causality (GNC) which have
already been developed as in Su and White (2008) and Moneta et al. (2010). Note that
the definition in (1) is different from that recently proposed in White and Lu (2010),
which goes further than GNC testing to make the point that structural causality infer-
ence must also involve a further conditional independence test: Conditional Exogeneity
(CE). In simple terms, CE tests whether the innovations process of the potential effect
is conditionally independent of the cause (or, by practical consequence, whether the
innovations processes are uncorrelated). White and Lu argue that if both GNC and CE
fail we ought not make any decision regarding causality, and combine the power of
both tests in a principled manner such that the probability of false causal inference, or
non-decision, is controlled. The difference in this study is that the concurrent failure
of GNC and CE is precisely the difficult situation requiring additional focus and it will
be argued that methods that can cope with this situation can also perform well for the
case of CE, although they require stronger assumptions. In effect, it is assumed that
real-world signals feature a high degree of non-causal correlation, due to aliasing ef-
fects as described in the following section, and that strong evidence to the contrary is
required, i.e. that non-decision is equivalent to inference of non-causality. The precise
meaning of ’driving’ will also be made explicit in the description of Causal Structural
Information, which is implicitly a proposed definition of H0 . Also different in Defini-
tion (1) than in White and Lu is the accounting of potential error in causal direction
assignment under a framework which forces the practitioner to make such a choice if
GNC is rejected.
One of the difficulties of causality inference methodology is that it is difficult to as-
certain what true causality in the real world (‘ground truth’) is for a sufficiently compre-
hensive class of problems (such that we can reliably gage error probabilities): hence the
need for extensive simulation. A clear means of validating a causal hypothesis would
be intervention Pearl (2000), i.e. modification of the presumed cause, but in instances
such as historic and geological data this is not feasible. The basic approach will be to
assume a non-informative probability distribution of the degree degree of mixing, or
72
Robust Statistics for Causality
non-causal dynamic interactions, as well as over individual spectra and compile infer-
ence error probabilities over a wide class of coupled dynamic systems. In constructing a
‘robust causality’ statistic there is more than simply null-hypothesis rejection and accu-
rate directionality to consider, however. In scientific practice we are not only interested
to know that A and B are causally related or not, but which is the main driver in case of
bidirectional coupling, and among a time series vector A, B, C, D... it is important to
determine which of these factors are the main causes of the target variable, say A. The
relative effect size and relative causal influence strength, lest the analysis be misused
(Ziliak and McCloskey, 2008). The rhetorical and scientific value of effect size in no
way devalues the underlying principle of robust statistics and controlled inference error
probabilities used to quantify it.
'
K
yi = Ak yi−k + Bu + wi (2)
k=1
Where {yi,d=1..D } is a real valued vector of dimension D. Notice the absence of a
subscript in the exogenous input term u. This is because a general treatment of exoge-
+K
nous inputs requires a lagged sum, i.e. k=1 Bk ui−k . Since exogenous inputs are not
explicitly addressed in the following derivations the general linear operator placeholder
Bu is used instead and can be re-substituted for subsequent use.
Granger non-causality for this system, expressed in terms of conditional indepen-
dence, would place a relation among elements of y subject to knowledge of u. If D = 2
, for all i
73
Popescu
which would correspond to a vanishing partial correlation. If all A’s are lower triangular
G non-causality is satisfied (in one direction but not the converse). It is however very
rare that the physical mechanism we are observing is indeed the embodiment of a VAR,
and therefore even in the case in which G non-causality can be safely rejected, it is
not likely that the best VAR approximation of the data observed is strictly lower/upper
triangular. The necessity of a distinction between strict causality, which has a struc-
tural interpretation, and a causality statistic, which does not measure independence in
the sense of Granger-non causality, but rather relative degree of dependence in both
directions among two signals (driving) is most evident in this case. If the VAR in ques-
tion had very small (and statistically observable) upper triangular elements would a
discussion of causality of the observed time series be rendered moot?
One of the most common physical mechanisms which is incompatible with VAR
is aliasing, i.e. dynamics which are faster than the (shortest) sampling interval. The
standard interpretation of aliasing is the false representation of frequency components
of a signal due to sub-Nyquist frequency sampling: in the multivariate time-series case
this can also lead to spurious correlations in the observed innovations process (Phillips,
1973). Consider a continuous bivariate VAR of order 1 with Gaussian innovations in
which the sampling frequency is several orders of magnitude smaller than the Nyquist
frequency. In this case we would observe a covariate time independent Gaussian pro-
cess since for all practical purposes the information travels ‘instantaneously’. In eco-
nomics, this effect could be due to social interactions or market reactions to news which
happen faster than the sampling interval (be it daily, hourly or monthly). In fMRI anal-
ysis sub- sampling interval brain dynamics are observed over a relatively slow time
convolution process of hemodynamic response of neural activity (for a detailed ex-
position of causality inference in fMRI see Roebroeck et al. (2011) in this volume).
Although ‘aliasing’ normally refers to temporal aliasing, the same process can occur
spatially. In neuroscience and in economics the observed variables are summations
(dimensionality reductions) of a far larger set of interacting agents, be they individuals
or neurons. In electroencephalography (EEG) the propagation of electrical potential
from cortical axons arrives via multiple pathways to the same recording location on
the scalp: the summation of micrometer scale electric potentials on the scalp at cen-
timeter scale. Once again there are spurious observable correlations: this is known as
the mixing problem. Such effects can be modeled, albeit with significant information
loss, by the same DGP class which is a superset of VAR and known in econometrics as
SVAR (structural vector auto-regression, the time series equivalent of structural equa-
tion modeling (SEM), often used in static causality inference (Pearl, 2000)). Another
basic problem in dynamic system identification is that we not only discard much in-
formation from the world in sampling it, but that our observations are susceptible to
additive noise, and that the randomness we see in the data is not entirely the random-
ness of the mechanism we intend to study. One of the most problematic of additive
noise models is mixed colored noise, in which there are structured correlations both in
time and across elements of the time-series, but not in any causal way: there is only a
linear transformation of colored noise, sometimes called mixing, due to spatial aliasing.
74
Robust Statistics for Causality
75
Popescu
Definition 2 The cover of a Data Generating Process (DGP) class is the cover of the
set of all outputs y that a DGP calculates for each member of the set of admissible
parameters a,u,t,w and for each initial condition y1 . Two DGPs are stochastically
equivalent if the cover of the set of their possible outputs (for fixed parameters) is the
same.
This differs from Equation (3) in two elemental ways: it is not a statement of inde-
pendence but a number (statistic), namely the average difference (rate) of conditional
(or prefix) Kolmogorov complexity of each point in the presumed effect vector when
given both vector histories or just one, and given the exogenous input history. It is a
generalized conditional entropy rate, and may be reasonably be normalized as such:
76
Robust Statistics for Causality
i " #
K 1' K(y1, j | y2, j−1..1 , y1, j−1..1 , u j−1..1 )
F2→1|u = 1− (5)
i j=1 K(y1, j | y1, j−1..1 , u j−1..1 )
'
K
yi = Ak yi−k + D wi−1 , Dii > 0, Di j = 0 (6)
k=1
The matrix D is a positive diagonal matrix containing the scaling, or effective stan-
dard deviations of the innovation terms. The standard deviation of each element of the
innovations term w is assumed hereafter to be equal to 1.
77
Popescu
'
K
yi = Ak yi−k + Bu + Dwi (7)
k=0
The difference between this and Equation (6) is the presence of a 0-lag matrix A0
which, for easy tractability has zero diagonal entries and is sometimes present on the
LHS. This 0-lag matrix is meant to model the sub-sampling interval dynamic interac-
tions among observations, which appear instantaneous, see Moneta et al. (2011) in this
volume. Let us call this form zero lag SVAR. In electric- and magneto- encephalography
(EEG/MEG) we often encounter the following form:
'
K
xi = µ Ak xi−k +µ Bu + Dwi ,
k=1
yi = Cxi (8)
Where C represents the observation matrix, or mixing matrix and is determined
by the conductivity/permeability of tissue, and accounts for the superposition of the
electromagnetic fields created by neural activity, which happens at nearly the speed of
light and therefore appears instantaneous. Let us call this mixed output SVAR. Finally,
in certain engineering applications we may see structured disturbances:
'
K
yi = θ Ak yi−k +θ Bu + Dw wi (9)
k=1
Which we shall call covariate innovations SVAR (Dw is a general nonsingular ma-
trix unlike D which is diagonal). Another final SVAR form to consider would be one
in which the 0-lag matrix ! A0 is strictly upper triangular (upper triangular zero lag
SVAR):
'
K
yi =! A0 yi + Ak yi−k +! Bu + Dwi (10)
k=1
'
K
yi = Ak yi−k + Bu +! Dwi (11)
k=0
Where ! D is upper/lower triangular. The SVAR forms (6)-(10) may look different,
and in fact each of them may uniquely represent physical processes and allow for direct
interpretation of parameters. From a statistical point of view, however, all four SVAR
DGPs introduced above are equivalent since they have identical cover.
Lemma 3 The Gaussian covariate innovations SVAR DGP has the same cover as the
Gaussian mixed output SVAR DGP. Each of these sets has a redundancy of 2N N! for
78
Robust Statistics for Causality
instances in which the matrices Dw is the product of and unitary and diagonal matri-
ces, the matrix C is a unitary matrix and the matrix A0 is a permutation of an upper
triangular matrix.
Proof Staring with the definition of covariate innovations SVAR in Equation (9) we
use the variable transformation y = Dw x and obtain the mixed-output form (trivial).
The set of Guassian random variables is closed under scalar multiplication (and hence
sign change) and addition. This means that the variance if the innovations term in
Equation (9) can be written as:
'
K
yi = θ Ak yi−k +θ Bu + UDw wi
k=1
'
K
xi = U T (θ Ak )U xi−k + U T (θ B)u + Dw wi
k=1
yi = U T x i
Since the transformation U is one to one and invertible, and since this transforma-
tion is what allows a (restricted) a covariate noise SVAR to map, one to one, onto a
mixed output SVAR, the cardinality of both covers is the same.
Now consider the zero-lag SVAR form:
79
Popescu
'
K
yi = Ak yi−k + Bu + Dwi
k=0
'
K
D−1 (1 − A0 ) yi = D−1 Ak yi−k + D−1 Bu + w
k=1
Taking the singular value decomposition of the (nonsingular) matrix coefficient on the
LHS:
'
K
U0 S V0T yi = D−1 Ak yi−k + D−1 Bu + wi
k=1
'
K
V0T yi = S −1 U0T D−1 Ak yi−k + S −1 U0T D−1 Bu + S −1 U0T wi
k=1
Using the coordinate transformation z = V0T y. The unitary transformation U0T can be
ignored due closure properties of the Gaussian. This leaves us with the mixed-output
form:
'
K
'
zi = S −1 U0T D−1 Ak V0 zi−k + S −1 U0T D−1 Bu + S −1 wi
k=1
y = V0 z
So far we’ve shown that for every zero-lag SVAR there is at least one mixed-output
VAR. Let us for a moment consider the covariate noise SVAR (after pre-multiplication)
'
K
D−1
w yi = D−1 −1
w θ Ak yi−k + Dw θ Bu + wi
k=1
5 6 '
K
yi = I − D−1
w y i + D−1 −1
w θ Ak yi−k + Dw θ Bu + wi
k=1
'
K
diag(D−1 −1 −1
w )yi = (diag(Dw ) − Dw )yi + D−1 −1
w θ Ak yi−k + Dw θ Bu + wi
k=1
D0 " diag(D−1
w )
80
Robust Statistics for Causality
'
K
yi = (I − D−1 −1
0 Dw )yi + D−1 −1 −1 −1 −1
0 Dw θ Ak yi−k + D0 Dw θ Bu + D0 wi
k=1
A0 = (I − D−1 −1
0 Dw )
D−1 −1
w = diag(Dw )(I − A0 )
A0 = (I − D−1 −1 T −1 −1
0 Dw ) (I − D0 Dw )
The zero lag matrix is a function of D−1 w , the inverse of which is an eigenvalue
problem. However, as long as the covariance matrix or its inverse is constant, the DGP
is unchanged and this allows N(N − 1)/2 degrees of freedom. Let us consider only
mixed input systems for which the innovations terms are of unit variance. There is no
real loss of generality since a simple row-division by each element of D0 normalized the
covariate noise form (to be regained by scaling the output). In this case the equivalence
constraint on is one of in which:
(I −! A0 )T (I −! A0 ) = (I − A0 )T (I − A0 )
If (I − A0 ) is full rank, a strictly upper triangular matrix ! A0 may be found that is
equivalent (this would be the Cholesky decomposition of the inverse covariance ma-
trix in reverse order). As DW is equivalent to a unitary transformation UDW this will
include permutations and orthogonal rotations. Any permutation of DW will imply a
corresponding permutation of A0 , which (along with rotations) has 2N N! solutions.
81
Popescu
Figure 1: SVAR causality and equivalence. Structural VAR equivalence and causal-
ity. A) direct structural Granger causality (both directions shown). z stands
for the delay operator. B) equivalent covariate innovations (left) and mixed
output systems. Neither representation shows dynamic interaction C) sparse,
one sided covariate innovations DAG is non sparse in the mixed output case
(and vice-versa). D) upper triangular structure of the zero-lag matrix is not
informative in the 2 variable Gaussian case, and is equivalent to a full mixed
output system.
82
Robust Statistics for Causality
Granger, in his 1969 paper, suggests that ‘instantaneous’ (i.e. covariate) effects be
ignored and only the temporal structure be used. Whether or not we accept instanta-
neous causality depends on prior knowledge: in the case of EEG, the mixing matrix
cannot have any physical ‘causal’ explanation even if it is sparse. Without additional
a priori assumptions, either we infer causality on unseen and presumably interacting
hidden variables (mixed output form, the case of EEG/MEG) or we assume a a non-
causal mixed innovations input. Note also that the zero-lag system appears to be causal
but can be written in a form which suggest the opposite difference causal influence
(hence it is sometimes termed ‘spurious causality’). In short, since instantaneous inter-
action in the Gaussian case cannot be resolved causally purely in terms of prediction
and conditional information (as intended by Wiener and Granger), it is proposed that
such interactions be accounted for but not given causal interpretation (as in ‘strong’
Granger non-causality) .
There are at least four distinct overall approaches to dealing with aliasing effects
in time series causality. 1) is to make prior assumptions about covariance matrices and
limit inference to domain relevant and interpretable posteriors, as in Bernanke et al.
(2005) in economics and Valdes-Sosa et al. (2005) in neuroscience. 2) to allow for
unconstrained graphical causal model type inference among covariate innovations, by
either assuming Gaussianity or non-Gaussianity, the latter allowing for stronger causal
inferences (see Moneta et al. (2011) in this volume). One possible drawback of this
approach is that DAG-type inference, at least in the Gaussian case in which there is
so-called ’Markov equivalence’ among candidate graphs, is non-unique. 3) a physi-
cally interpretable mixed output or co-variate innovations is assumed and the inferred
sparsity structure (or the intersection thereof over the nonzero lag coefficient matrices)
as the connection graph. Popescu (2008) implemented such an approach by using the
minimum description length principle to provide a universal prior over rational-valued
coefficients, and was able to recover structure in the majority of simulated co-variate
innovations processes of arbitrary sparsity. This approach is computationally laborious,
as it is NP and non-convex, and moreover a system that is sparse in one form (covari-
ate innovations or mixed-ouput) is not necessarily sparse in another equivalent SVAR
form. Moreover completely dense SVAR systems may be non-causal (in the strong
GC sense). 4) Causality is not interpreted as a binary value, but rather direction of
interaction is determined as a continuous valued statistic, and one which is theoreti-
cally robust to covariate innovations or mixtures. This is the principle of the recently
introduced phase slope index (PSI), which belongs to a class of methods based on spec-
tral decomposition and partition of coherency. Although auto-regressive, spectral and
impulse response convolution are theoretically equivalent representation of linear dy-
namics, they do differ numerically and spectral representations afford direct access to
phase estimates which are crucial to the interpretation of lead and lag as it relates to
causal influence. These methods are reviewed in the next section.
83
Popescu
S i j (ω)
Ci j (ω) = ! (13)
S ii (ω)S j j (ω)
Which consists of a complex numerator and a real-valued denominator. The coher-
ence is the squared magnitude of the coherency:
Besides various histogram and discrete (fast) Fourier transform methods available
for the computation of coherence, AR methods may be also used, since they are also
linear transforms, the Fourier transform of the delay operator being simply zk = e− j2πωτS
where τS is the sampling time and k = ωτS . Plugging this into Equation (9) we obtain:
K
'
X( jω) = Ak e − j2πωτS k X( jω) + BU( jω) + D
k=1
−1
'K
S
Y( jω) = C I − Ak e− j2πωτ k (BU( jω) + DW( jω))
(16)
k=1
In terms of a SVAR therefore (as opposed to VAR) the mixing matrix C does not
affect stability, nor the dynamic response (i.e. the poles). The transfer functions from
ith innovations to jth output are entries of the following matrix of functions:
−1
'K
S
H( jω) = C I − Ak e− j2πωτ k D
(17)
k=1
The spectral matrix is simply (having already assumed independent unit Gaussian
noise):
84
Robust Statistics for Causality
Where the subscripts refer to row/column subsets of the matrix S ( jω). The partial
spectrum, substituted into Equation (13) gives us partial coherency Ci, j|(i, j) ( jω) and
correspondingly, partial coherence ci, j|(i, j) ( jω) . These functions are symmetric and
therefore cannot indicate direction of interaction in the pair (i, j). Several alternatives
have been proposed to account for this limitation. Kaminski and Blinowska (1991);
Blinowska et al. (2004) proposed the following normalization of H( jω) which attempts
to measure the relative magnitude of the transfer function from any innovations process
to any output (which is equivalent to measuring the normalized strength of Granger
causality) and is called the directed transfer function (DTF):
Hi j ( jω)
γi j ( jω) = I
+ 2
k |Hik ( jω)|
JJ J2
2
JHi j ( jω)JJ
γi j ( jω) = + 2
(20)
k |Hik ( jω)|
A similar measure is called directed coherence Baccalá et al. (Feb 1991), later elab-
orated into a method complimentary to DTF, called partial directed coherence (PDC)
Baccalá and Sameshima (2001); Sameshima and Baccalá (1999), based on the inverse
of H:
Hi−1j ( jω)
πi j ( jω) = I J JJ2
+ J −1
k J Hik ( jω)J
H GC
j→i|u = H(yi,t+1 |y:,t:t−K , u:,t:t−K ) − H(yi,t+1 |y( j),t:t−K , u:,t:t−K )
85
Popescu
( j)
H GC
j→i|u = log Di − log Di (21)
The Shannon entropy of a Gaussian random variable is the logarithm of its standard
deviation plus a constant. Notice than in this paper the definition of Granger Causality
is slightly different than the literature in that it relates to the innovations process of a
mixed output SVAR system of closest rotation and not a regular MVAR. The second
( j)
term Di is formed by computing a reduced SVAR system which omits the jth variable.
Recently Barrett et al. have proposed an extension of GC, based on prior work by
Geweke (1982) from interaction among pairs of variables to groups of variables, termed
multivariate Granger Causality (MVGC) Barrett et al. (2010). The above definition is
straightforwardly extensible to the group case, where I ad J are subsets of 1..D, since
total entropy of independent variables is the sum of individual entropies.
'K L
GC (J)
H J→I|u = log Di − log Di (22)
i∈I
The Granger entropy can be calculated directly from the transfer function, using the
Shannon-Hartley theorem:
JJ J2
' JHi j (ω)JJ
H GCH
j→i = − ∆ω ln 1 − (23)
ω
S ii (ω)
Finally Nolte (Nolte et al., 2008) introduced a method called Phase Slope Index
which evaluates bilateral causal interaction and is robust to mixing effects (i.e. zero
lag, observation or innovations covariance matrices that depart from MVAR):
' ∗
PS Ii j→i = Im Ci j (ω) Ci j (ω + dω) (24)
ω
PSI, as a method is based on the observation that pure mixing (that is to say, all
effects stochastically equivalent to output mixing as outlined above) does not affect
the imaginary part of the coherency Ci j just as (equivalently) it does not affect the
antisymmetric part of the auto-correlation of a signal. It does not place a measure the
phase relationship per se, but rather the slope of the coherency phase weighted by the
magnitude of the coherency.
86
Robust Statistics for Causality
i j∗ " {i, j, i j}
i∗ " {i, i j}
This means that we reorder the (identified) mixed-innovations system by placing
the target time series first and the driver series second, followed by all the rest. The
same ordering, minus the driver is also useful. We define CSI as
CS I(i, j|i j)
F CS I " (27)
j→i|i j log(!i∗ D11 ) + log(! j∗ D11 ) + ζ
This normalization effectively measures the ratio of causal to non-causal information,
where ζ is a constant which depends on the number of dimensions and quantization
width and is necessary to transform continuous entropy to discrete entropy.
87
Popescu
1 '
ρi j (kτ0 ) # xi (q)x j (q + k) (31)
N − k q=1:N−k
4 7 '
N/2 5 62
ai j , bi j = arg min ρi j (kτ0 ) − ai j,n cos(2πΩn τ0 k) − bi j,n sin(2πΩn τ0 k) (32)
k=−N/2
where τ0 is the sampling interval. Note that for an odd number of points the regres-
sion above is actually a well determined set of equations, corresponding to the 2-sided
DFT. Note also that by replacing the expectation with the geometric mean, the above
equation can also be written (with a slight change in weighting at individual lags) as:
4 7 ' 5 62
ai j , bi j = arg min xi,p x j,q − ai j,n cos(2πΩk (ti,p − t j,q )) − bi j,n sin(2πΩn (ti,p − t j,q ))
p,q∈1..N
(33)
The above equation holds even for time series sampled at unequal (but overlapping)
times (xi , ti ) and (x j , t j ) as long as the frequency basis definition is adjusted (for example
τ0 = 1). It represents a discrete, finite approximation of the continuous, infinite auto-
regression function of an infinitely long random process. It is a regression on the outer
product of the vectors xi and x j . Since autocorrelations for finite memory systems tend
88
Robust Statistics for Causality
to fall off to zero with increasing lag magnitude, a novel coherency estimate is proposed
based on the cardinal sine and cosine functions, which also decay, as a compact basis:
'
Ĉi j (ω) = ai j,n C(Ωn ) + j bi j,n S(Ωn ) (34)
4 7
ai j , bi j = arg min
' 5 62
xi,p x j,q − ai j,n cosc(2πΩk (ti,p − t j,q )) − bi j,n sinc(2πΩn (ti,p − t j,q )) (35)
p,q∈1..N
Where the sine cardinal is defined as sinc(x) = sin(πx)/x, and its Fourier transform
is S( jω) = 1, | jω| < 1 and S( jω) = 0 otherwise. Also the Fourier transform of the cosine
cardinal can be written as C( jω) = jω S ( jω). Although in principle we could choose
any complete basis as a means of Fourier transform estimation, the cardinal transform
preserves the odd-even function structure of the standard trigonometric pair. Computa-
tionally this means that for autocorrelations, which are real valued and even, only sinc
needs to be calculated and used, while for cross-correlation both functions are needed.
As linear mixtures of independent signals only have symmetric cross-correlations, any
nonzero values of the cosc coefficients would indicate the presence of dynamic interac-
tion. Note that the Fast Fourier Transform earns its moniker thanks to the orthogonality
of sin and cos which allows us to avoid a matrix inversion. However their orthogonality
holds true only for infinite support, and slight correlations are found for finite windows -
in practice this effect requires further computation (windowing) to counteract. The car-
dinal basis is not orthogonal, requires full regression and may have demanding memory
requirements. For moderate size data this not problematic and implementation details
will be discussed elsewhere.
K ,
' -
a11 0
xN,i = x + wN,i
0 a22 N,k N,i−k
k=1
89
Popescu
6yN 6F
y = (1 − |β|)yN + |β|yC (38)
6yC 6F
The two sub-systems are pictured graphically as systems A and B in Figure 1.
If β < 0 the AR matrices that create yC are transposed (meaning that y1C causes y2C
instead of the opposite). The coefficient β is represented in Nolte et al. (2010) by ’γ
’ where β = sgn(γ)(1 − |γ |). All coefficients are generated as independent Gaussian
random variables of unit variance, and unstable systems are discarded. While both
the causal and noise generating systems have the same order, note that the system that
would generate the sum thereof requires an infinite order SVAR DGP to generate (it
is not stochastically equivalent to any SVAR DGP but instead is a SVARMA DGP,
having both poles and zeros). Nevertheless it is an interesting benchmark since the exact
parameters are not fully recoverable via the commonly used VAR modeling procedure
and because the causality interpretation is fairly clear: the sum of a strictly causal DGP
and a stochastic noncausal DGP should retain the causality of the former.
In this study, the same DGPs were used as in NOISE but as one of the current
aims is to study the influence of sample size on the reliability of causality assignment,
signals of 100, 500, 1000 and 5000 points were generated (as opposed to the original
6000). This is the dataset referred to as PAIRS below, which only differs in numbers
of samples per time series. For each evaluation 500 time series were simulated, with
the order for each system of being uniformly distributed from 1 to 10. The following
methods were evaluated:
• Directed transfer function DTF. estimation using an automatic model order se-
lection criterion (BIC, Bayesian Information Criterion) using a maximum model
order of 20. DTF has been shown to be equivalent to GC for linear AR models
(Kaminski et al., 2001) and therefore GC itself is not shown . The covariance
matrix of the residuals is also included in the estimate of the transfer function.
The same holds for all methods described below.
90
Robust Statistics for Causality
All methods were statistically evaluated for robustness and generality by perform-
ing a 5-fold jackknife, which gave both a mean and standard deviation estimate for each
method and each simulation. All statistics reported below are mean normalized by stan-
dard deviation (from jackknife). For all methods the final output could be -1, 0, or 1,
corresponding to causality assignment 1→2, no assignment, and causality 2 → 1. A true
positive (TP) was the rate of correct causality assignment, while a false positive (FP)
was the rate of incorrect causality assignment (type III error), such that TP+FP+NA=1,
where NA stands for rate of no assignment (or neutral assignment). The TP and FP
rates are be co-modulated by increasing/decreasing the threshold of the absolute value
of the mean/std statistic, under which no causality assignment is made:
In Table 2 we can see results for a PAIRS, in which the noise mixing matrix B is
not strictly diagonal.
91
Popescu
Figure 2: PSI vs. DTF Scatter plots of β vs. STAT (to the left of each panel), TP vs.
FP curves for different time series lengths (100, 500,1000 and 500) (right).
a) colored unmixed additive noise. b) colored mixed additive noise. DTF
is equivalent to Granger Causality for linear systems. All STAT values are
jackknife mean normalized by standard deviation.
92
Robust Statistics for Causality
Figure 3: PSI vs. CSI Scatter plots of β vs. STAT (to the left of each panel), TP vs. FP
curves for different time series lengths (right). a) unmixed additive noise. b)
mixed additive noise
93
Popescu
As we can see in both Figure 2 and Table 1, all methods are almost equally robust
to unmixed colored additive noise (except PDC). However, while addition of mixed
colored noise induces a mild gap in maximum accuracy, it creates a large gap in terms
of TP/FP rates. Note the dramatic drop-off of the TP rate of VAR/SVAR based methods
PDC and DTF. Figure 3 shows this most clearly, by a wide scatter of STAT outputs
for DTF around β = 0 that is to say with no actual causality in the time series and a
corresponding fall-off of TP vs. FP rates. Note also that PSI methods still allow a fairly
reasonable TP rate determination at low FP rates of 10% even at 100 points per time-
series, while the CSI method was also robust to the addition of colored mixed noise,
not showing any significant difference with respect to PSI except a higher FP rate for
longer time series (N=5000). The advantage of PSIcardinal was near PSI in overall
accuracy. In conclusion, DTF (or weak Granger causality) and PDC are not robust with
respect to additive mixed colored noise, although they perform similarly to PSI and CSI
for independent colored noise.1
1. Correlation and rank correlation analysis was performed (for N=5000) to shed light on the rea-
son for the discrepancy between PSI and CSI. The linear correlation between rawSTAT and STAT
was .87 and .89 for PSI and CSI. No influence of model order K of the simulated system was
seen in the error of either PSI or CSI, where error is estimated as the difference in rank of
rankerr(S T AT ) = |rank(β) − rank(S T AT )|. There were however significant correlations between
rank(|β|) and rankerr(S T AT ), -.13 for PSI and -.27 for CSI. Note that as expected, standard Gran-
ger causality (GC) performed the same as DTF (TP=0.116 for FP<.10). Using Akaike’s Information
Criterion (AIC) instead of BIC for VAR model order estimation did not significantly affect AR-based
STAT values.
94
Robust Statistics for Causality
95
Popescu
' K a11 0 0
β > 0 yC,i = a12 a22 0 yC,i−k + wC,i (39)
k=1 0 0 a33 C,k
'K a11 0 0
xN,i = 0 a22 0 xN,i−k + wN,i
k=1 0 0 a22 N,k
'K a11 0 a13
xD,i = 0 a22 a23 xD,i−k + wD,i
k=1 0 0 a22 D,k
6yN 6F
y MC = (1 − |β|)yN + |β|yC (41)
6yC 6F
6y MC 6F
yDC = (1 − χ )y MC + χ yD (42)
6yD 6F
The Table 3, similar to the tables in the preceding section, shows results for all
usual methods, except for PSIpartial which is PSI calculated on the partial coherence
as defined above and calculated from Welch (cross-spectral) estimators in the case of
mixed noise and a common driver.
Notice that the TP rates are lower for all methods with respect to Table 2 which
represents the mixed noise situation without any common driver.
10. Discussion
In a recent talk, Emanuel Parzen (Parzen, 2004) proposed, both in hindsight and for
future consideration, that aim of statistics consist in an ‘answer machine’, i.e. a more
intelligent, automatic and comprehensive version of Fisher’s almanac, which currently
consists in a plenitude of chapters and sections related to different types of hypotheses
96
Robust Statistics for Causality
and assumption sets meant to model, insofar as possible, the ever expanding variety of
data available. These categories and sub-categories are not always distinct, and further-
more there are competing general approaches to the same problems (e.g. Bayesian vs.
frequentist). Is an ‘answer machine’ realistic in terms of time-series causality, prerequi-
sites for which are found throughout this almanac, and which has developed in parallel
in different disciplines?
This work began by discussing Granger causality in abstract terms, pointing out the
implausibility of finding a general method of causal discovery, since that depends on
the general learning and time-series prediction problem, which are incomputable. How-
ever, if any consistent patterns that can be found mapping the history of one time series
variable to the current state of another (using non-parametric tests), there is sufficient
evidence of causal interaction and the null hypothesis is rejected. Such a determination
still does not address direction of interaction and relative strength of causal influence,
which may require a complete model of the DGP. This study - like many others - relied
on the rather strong assumption of stationary linear Gaussian DGPs but otherwise made
weak assumptions on model order, sampling and observation noise. Are there, instead,
more general assumptions we can use? The following is a list of competing approaches
in increasing order of (subjectively judged) strength of underlying assumption(s):
97
Popescu
98
Robust Statistics for Causality
99
Popescu
Acknowledgments
I would like to thank Guido Nolte for his insightful feedback and informative discus-
sions.
References
H. Akaike. On the use of a linear model for the identification of feedback systems.
Annals of the Institute of statistical mathematics, 20(1):425–439, 1968.
P. E. Caines. Weak and strong feedback free processes. IEEE. Trans. Autom . Contr,
21:737–739, 1976.
T. Chu and C. Glymour. Search for additive nonlinear time series causal models. The
Journal of Machine Learning Research, 9:967–991, 2008.
100
Robust Statistics for Causality
R. A. Fisher. Statistical Methods for Research Workers. Macmillan Pub Co, 1925.
ISBN 0028447301.
I. Guyon. Time series analysis with the causality workbench. Journal of Ma-
chine Learning Research, Workshop and Conference Proceedings, XX. Time Series
Causality:XX–XX, 2011.
101
Popescu
102
Robust Statistics for Causality
103
Popescu
H. White and X. Lu. Granger Causality and Dynamic Structural Systems. Journal of
Financial Econometrics, 8(2):193, 2010.
A. Ziehe and K.-R. Mueller. Tdsep- an efficient algorithm for blind separation using
time structure. ICANN Proceedings, pages 675–680, 1998.
S. T. Ziliak and D. N. McCloskey. The Cult of Statistical Significance: How the Stan-
dard Error Costs Us Jobs, Justice, and Lives. University of Michigan Press, February
2008. ISBN 0472050079.
Appendix A. Statistical significance tables for Type I and Type III errors
In order to assist practitioners in evaluating the statistical significance of bivariate
causality testing, tables were prepared for type I and type III error probabilities as de-
fined in (1) for different values of the base statistic. Below tables are provided for both
the jacknifed statistic Ψ/std(Ψ) and for the raw statistic Ψ, which is needed in case the
number of points is too low to allow a jackknife/cross-validation/bootstrap or compu-
tational speed is at a premium. The spectral evaluation method is Welch’s method as
described in Section 6. There were 2000 simulations for each condition. The tables in
this Appendix differ in one important aspect with respect to those in the main text. In
order to avoid non-informative comparison of datasets which are, for example, analyses
of the same physical process sampled at different sampling rates, the number of points
is scaled by the ‘effective’ number of points which is essentially the number of samples
relative to a simple estimate of the observed signal bandwidth:
O
N ∗ = NτS / BW
O= 6X6F
BW
6∆X/∆T 6 F
The values marked with an asterisk have values of both α and γ which are less than
5%. Note also that Ψ is non-dimensional index.
104
Robust Statistics for Causality
105
Popescu
Table 6: α vs. Ψ
Table 7: γ vs. Ψ
106
JMLR: Workshop and Conference Proceedings 12:1–29
Abstract
The causal notions embodied in the concept of Granger causality have been argued
to belong to a different category than those of Judea Pearl’s Causal Model, and so
far their relation has remained obscure. Here, we demonstrate that these concepts
are in fact closely linked by showing how each relates to straightforward notions of
direct causality embodied in settable systems, an extension and refinement of the Pearl
Causal Model designed to accommodate optimization, equilibrium, and learning. We
then provide straightforward practical methods to test for direct causality using tests
for Granger causality.
Keywords: Causal Models, Conditional Exogeneity, Conditional Independence,
Granger Non-causality
1. Introduction
The causal notions embodied in the concept of Granger causality (“G−causality") (e.g.,
Granger, C.W.J., 1969; Granger, C.W.J. and P. Newbold, 1986) are probabilistic, re-
lating to the ability of one time series to predict another, conditional on a given infor-
mation set. On the other hand, the causal notions of the Pearl Causal Model (“PCM")
(e.g., Pearl, J., 2000) involve specific notions of interventions and of functional rather
than probabilistic dependence. The relation between these causal concepts has so far
remained obscure. For his part, Granger, C.W.J. (1969) acknowledged that G−causality
was not “true" causality, whatever that might be, but that it seemed likely to be an im-
portant part of the full story. On the other hand, Pearl, J. (2000, p. 39) states that
“econometric concepts such as ‘Granger causality’ (Granger, C.W.J., 1969) and ‘strong
exogeneity’ (Engle, R., D. Hendry, and J.-F. Richard, 1983) will be classified as statis-
tical rather than causal." In practice, especially in economics, numerous studies have
used G−causality either explicitly or implicitly to draw structural or policy conclusions,
but without any firm foundation.
Recently, White, H. and X. Lu (2010a, “WL”) have provided conditions under
which G−causality is equivalent to a form of direct causality arising naturally in dy-
namic structural systems, defined in the context of settable systems. The settable sys-
tems framework, introduced by White, H. and K. Chalak (2009, “WC”), extends and
refines the PCM to accommodate optimization, equilibrium, and learning. In this pa-
per, we explore the relations between direct structural causality in the settable sys-
tems framework and notions of direct causality in the PCM for both recursive and
non-recursive systems. The close correspondence between these concepts in the re-
cursive systems relevant to G−causality then enables us to show that there is in fact a
close linkage between G−causality and PCM notions of direct causality. This enables
us to provide straightforward practical methods to test for direct causality using tests
for Granger causality.
In a related paper, Eichler, M. and V. Didelez (2009) also study the relation between
G−causality and interventional notions of causality. They give conditions under which
Granger non-causality implies that an intervention has no effect. In particular, Eich-
ler, M. and V. Didelez (2009) use graphical representations as in Eichler, M. (2007)
of given G−causality relations satisfying the “global Granger causal Markov prop-
erty" to provide graphical conditions for the identification of effects of interventions
in “stable" systems. Here, we pursue a different route for studying the interrelations
between G−causality and interventional notions of causality. Specifically, we see that
G−causality and certain settable systems notions of direct causality based on functional
dependence are equivalent under a conditional form of exogeneity. Our conditions are
alternative to “stability" and the “global Granger causal Markov property," although
particular aspects of our conditions have a similar flavor.
As a referee notes, the present work also provides a rigorous complement, in dis-
crete time, to work by other authors in this volume (for example Roebroeck, A., Seth,
A.K., and Valdes-Sosa, P., 2011) on combining structural and dynamic concepts of
causality.
The plan of the paper is as follows. In Section 2, we briefly review the PCM. In Sec-
tion 3, we motivate settable systems by discussing certain limitations of the PCM using
a series of examples involving optimization, equilibrium, and learning. We then specify
a formal version of settable systems that readily accommodates the challenges to causal
discourse presented by the examples of Section 3. In Section 4, we define direct struc-
tural causality for settable systems and relate this to corresponding notions in the PCM.
The close correspondence between these concepts in recursive systems establishes the
first step in linking G−causality and the PCM. In Section 5, we discuss how the results
of WL complete the chain by linking direct structural causality and G−causality. This
also involves a conditional form of exogeneity. Section 6 constructs convenient practi-
108
Linking Granger Causality and the Pearl Causal Model with Settable Systems
cal tests for structural causality based on proposals of WL, using tests for G−causality
and conditional exogeneity. Section 7 contains a summary and concluding remarks.
The unique fixed point requirement is crucial to the PCM, as this is necessary for
defining the potential response function (Pearl, J., 2000, def. 7.1.4). This provides the
foundation for discourse about causal relations between endogenous variables; without
the potential response function, causal discourse is not possible in the PCM. A variant
of the PCM (Halpern, J., 2000) does not require a fixed point, but if any exist, there may
be multiple collections of functions g yielding a fixed point. We call this a Generalized
Pearl Causal Model (GPCM). As GPCMs also do not possess an analog of the potential
response function in the absence of a unique fixed point, causal discourse in the GPCM
is similarly restricted.
In presenting the PCM, we have adapted Pearl’s notation somewhat to facilitate
subsequent discussion, but all essential elements are present and complete.
Pearl, J. (2000) gives numerous examples for which the PCM is ideally suited for
supporting causal discourse. As a simple game-theoretic example, consider a market
in which there are exactly two firms producing similar but not identical products (e.g.,
Coke and Pepsi in the cola soft-drink market). Price determination in this market is a
two-player game known as “Bertrand duopoly."
In deciding its price, each firm maximizes its profit, taking into account the pre-
vailing cost and demand conditions it faces, as well as the price of its rival. A simple
system representing price determination in this market is
p1 = a1 + b1 p2
p2 = a2 + b2 p1 .
Here, p1 and p2 represent the prices chosen by firms 1 and 2 respectively, and a1 , b1 ,
a2 , and b2 embody the prevailing cost and demand conditions.
109
White Chalak Lu
We see that this maps directly to the PCM with n = 2, endogenous variables v =
(p1 , p2 ), background variables u = (a1 , b1 , a2 , b2 ), and structural functions
f1 (v2 , u) = a1 + b1 p2
f2 (v1 , u) = a2 + b2 p1 .
This fixed point represents the Nash equilibrium for this two-player game.
Clearly, the PCM applies perfectly, supporting causal discourse for this Bertrand
duopoly game. Specifically, we see that p1 causes p2 and vice-versa, and that the effect
of p2 on p1 is b1 , whereas that of p1 on p2 is b2 .
In fact, the PCM applies directly to a wide variety of games, provided that the game
has a unique equilibrium. But there are many important cases where there may be no
equilibrium or multiple equilibria. This limits the applicability of the PCM. We explore
examples of this below, as well as other features of the PCM that limit its applicability.
3. Settable Systems
3.1. Why Settable Systems?
WC motivate the development of the settable system (SS) framework as an extension of
the PCM that accommodates optimization, equilibrium, and learning, which are central
features of the explanatory structures of interest in economics. But these features are
of interest more broadly, especially in machine learning, as optimization corresponds
to any intelligent or rational behavior, whether artificial or natural; equilibrium (e.g.,
Nash equilibrium) or transitions toward equilibrium characterize stable interactions be-
tween multiple interacting systems; and learning corresponds to adaptation and evolu-
tion within and between interacting systems. Given the prevalence of these features in
natural and artificial systems, it is clearly desirable to provide means for explicit and
rigorous causal discourse relating to systems with these features.
To see why an extension of the PCM is needed to handle optimization, equilibrium,
and learning, we consider a series of examples that highlight certain limiting features
of the PCM: (i) in the absence of a unique fixed point, causal discourse is undefined;
(ii) background variables play no causal role; (iii) the role of attributes is restricted;
and (iv) only a finite rather than a countable number of units is permitted. WC discuss
further relevant aspects of the PCM, but these suffice for present purposes.
Example 3.1 (Equilibria in Game Theory) Our first example concerns general
two-player games, extending the discussion that we began above in considering Bertrand
duopoly.
110
Linking Granger Causality and the Pearl Causal Model with Settable Systems
Let two players i = 1, 2 have strategy sets S i and utility functions ui , such that πi =
ui (z1 , z2 ) gives player i’s payoff when player 1 plays z1 ∈ S 1 and player 2 plays z2 ∈ S 2 .
Each player solves the optimization problem
max ui (z1 , z2 ).
zi ∈S i
The solution to this problem, when it exists, is player i’s best response, denoted
where rie is player i’s best response function (the superscript “e" stands for “elementary,"
conforming to notation formally introduced below); z(i) denotes the strategy played by
the player other than i; and a := (S 1 , u1 , S 2 , u2 ) denotes given attributes defining the
game. For simplicity here, we focus on “pure strategy" games; see Gibbons, R. (1992)
for an accessible introduction to game theory.
Different configurations for a correspond to different games. For example, one of
the most widely known games is prisoner’s dilemma, where two suspects in a crime are
separated and offered a deal: if one confesses and the other does not, the confessor is
released and the other goes to jail. If both confess, both receive a mild punishment. If
neither confesses, both are released. The strategies are whether to confess or not. Each
player’s utility is determined by both players’ strategies and the punishment structure.
Another well known game is hide and seek. Here, player 1 wins by matching player
2’s strategy and player 2 wins by mismatching player 1’s strategy. A familiar example
is a penalty kick in soccer: the goalie wins by matching the direction (right or left) of
the kicker’s kick; the kicker wins by mismatching the direction of the goalie’s lunge.
The same structure applies to baseball (hitter vs. pitcher) or troop deployment in battle
(aggressor vs. defender).
A third famous game is battle of the sexes. In this game, Ralph and Alice are trying
to decide how to spend their weekly night out. Alice prefers the opera, and Ralph
prefers boxing; but both would rather be together than apart.
Now consider whether the PCM permits causal discourse in these games, e.g., about
the effect of one player’s action on that of the other. We begin by mapping the elements
of the game to the elements of the PCM. First, we see that a corresponds to PCM back-
ground variables u, as these are specified outside the system. The variables determined
within the system, i.e., the PCM endogenous variables are z := (z1 , z2 ) corresponding to
v, provided that (for now) we drop the distinction between yi and zi . Finally, we see that
the best response functions rie correspond to the PCM structural functions fi .
To determine whether the PCM permits causal discourse in these games, we can
check whether there is a unique fixed point for the best responses. In prisoner’s dilemma,
there is indeed a unique fixed point (both confess), provided the punishments are suit-
ably chosen. The PCM therefore applies to this game to support causal discourse. But
there is no fixed point for hide and seek, so the PCM cannot support causal discourse
there. On the other hand, there are two fixed points for battle of the sexes: both Ralph
and Alice choose opera or both choose boxing. The PCM does not support causal dis-
course there either. Nor does the GPCM apply to the latter games, because even though
111
White Chalak Lu
it does not require a unique fixed point, the potential response functions required for
causal discourse are not defined.
The importance of game theory generally in describing the outcomes of interactions
of goal-seeking agents and the fact that the unique fixed point requirement prohibits the
PCM from supporting causal discourse in important cases strongly motivates formu-
lating a causal framework that drops this requirement. As we discuss below, the SS
framework does not require a unique fixed point, and it applies readily to games gener-
ally. Moreover, recognizing and enforcing the distinction between yi (i’s best response
strategy) and zi (an arbitrary setting of i’s strategy) turns out to be an important compo-
nent to eliminating this requirement.
Another noteworthy aspect of this example is that a is a fixed list of elements that
define the game. Although elements of a may differ across players, they do not vary for
a given player. This distinction should be kept in mind when referring to the elements
of a as background “variables."
where z1 and z2 represent quantities consumed of goods 1 (beer) and 2 (pizza) respec-
tively and U is the utility function that embodies the consumer’s preferences for the
two goods. For simplicity, let the price of a beer be $1, and let p represent the price of
pizza; m represents funds available for expenditure, “income" for short1 . The budget
constraint m = z1 + pz2 ensures that total expenditure on beer and pizza does not exceed
income (no borrowing) and also that total expenditure on beer and pizza is not less than
m. (As long as utility is increasing in consumption of the goods, it is never optimal to
expend less than the funds available.)
Solving the consumer’s demand problem leads to the optimal consumer demands
for beer and pizza, y1 and y2 . It is easy to show that these can be represented as
where r1a and r2a are known as the consumer’s market demand functions for beer and
pizza. The “a" superscript stands for “agent," corresponding to notation formally intro-
duced below. The attributes a include the consumer’s utility function U (preferences)
and the admissible values for z1 , z2 , p, and m, e.g., R+ := [0, ∞).
Now consider how this problem maps to the PCM. First, we see that a and (p, m)
correspond to the background variables u, as these are not determined within the sys-
tem. Next, we see that y := (y1 , y2 ) corresponds to PCM endogenous variables v. Finally,
1. Since a beer costs a dollar, it is the “numeraire," implying that income is measured in units of beer.
This is a convenient convention ensuring that we only need to keep track of the price ratio between
pizza and beer, p, rather than their two separate prices.
112
Linking Granger Causality and the Pearl Causal Model with Settable Systems
we see that the consumer demand functions ria correspond to the PCM structural func-
tions fi . Also, because the demand for beer, y1 , does not enter the demand function
for pizza, r2a , and vice versa, there is a unique fixed point for this system of equations.
Thus, the PCM supports causal discourse in this system.
Nevertheless, this system is one where, in the PCM, the causal discourse natural
to economists is unavailable. Specifically, economists find it natural to refer to “price
effects" and “income effects" on demand, implicitly or explicitly viewing price p and
income m as causal drivers of demand. For example, the pizza demand price effect
is (∂/∂p)r2a (p, m; a). This represents how much optimal pizza consumption (demand)
will change as a result of a small (marginal) increase in the price of pizza. Similarly,
the pizza demand income effect is (∂/∂m)r2a (p, m; a), representing how much optimal
pizza consumption will change as a result of a small increase in income. But in the
PCM, causal discourse is reserved only for endogenous variables y1 and y2 . The fact
that background variables p and m do not have causal status prohibits speaking about
their effects.
Observe that the “endogenous" status of y and “exogenous" status of p and m is
determined in SS by utility maximization, the “governing principle" here. In contrast,
there is no formal mechanism in the PCM that permits making these distinctions. Al-
though causal discourse in the PCM can be rescued for such systems by “endogenizing"
p and m, that is, by positing additional structure that explains the genesis of p and m
in terms of further background variables, this is unduly cumbersome. It is much more
natural simply to permit p and m to have causal status from the outset, so that price and
income effects are immediately meaningful, without having to specify their determin-
ing processes. The SS framework embodies this direct approach. Those familiar with
theories of price and income determination will appreciate the considerable complica-
tions avoided in this way. The same simplifications occur with respect to the primitive
variables appearing in any responses determined by optimizing behavior.
Also noteworthy here is the important distinction between a, which represents fixed
attributes of the system, and p and m, which are true variables that can each take a range
of different possible values. As WC (p.1774) note, restricting the role of attributes
by “lumping together" attributes and structurally exogenous variables as background
objects without causal status creates difficulties for causal discourse in the PCM:
[this] misses the opportunity to make an important distinction between
invariant aspects of the system units on the one hand and counterfactual
variation admissible for the system unit values on the other. Among other
things, assigning attributes to u interferes with assigning natural causal
roles to structurally exogenous variables.
By distinguishing between attributes and structurally exogenous variables, settable sys-
tems permit causal status for variables determined outside a given system, such as when
price and income drive consumer demand.
113
White Chalak Lu
where y1,0 and y2,0 are given scalars, a := (a11 , a12 , a21 , a22 )' is a given real “coefficient"
vector, and {ut := (u1,t , u2,t ) : t = 1, 2, . . .} is a given sequence. This system describes the
evolution of {yt := (y1,t , y2,t ) : t = 1, 2, . . .} through time.
Now consider how this maps to the PCM. We see that y0 := (y1,0 , y2,0 ), {ut }, and a
correspond to the PCM background variables u, as these are not determined within the
system. Further, we see that the sequence {yt } corresponds to the endogenous variables
v, and that the PCM structural functions fi correspond to
where yt−1 := (y0 , . . . , yt−1 ) and ut := (u1 , . . . , ut ) represent finite “histories" of the indi-
cated variables. We also see that this system is recursive, and therefore has a unique
fixed point.
The challenge to the PCM here is that it permits only a finite rather than a countable
number of units: both the number of background variables (m) and endogenous vari-
ables (n) must be finite in the PCM, whereas the structural VAR requires a countable
infinity of background and endogenous variables. In contrast, settable systems per-
mit (but do not require) a countable infinity of units, readily accommodating structural
VARs.
In line with our previous discussion, settable systems distinguish between system
attributes a (a fixed vector) and structurally exogenous causal variables y0 and {ut }. The
difference in the roles of y0 and {ut } on the one hand and a on the other are particularly
clear in this example. In the PCM, these are lumped together as background variables
devoid of causal status. Since a is fixed, its lack of causal status is appropriate; indeed,
a represents effects here2 , not causes. But the lack of causal status is problematic for
the variables y0 and {ut }; for example, this prohibits discussing the effects of structural
“shocks" ut .
Observe that the structural VAR represents u1,t as a causal driver of y1,t , as is stan-
dard. Nevertheless, settable systems do not admit “instantaneous" causation, so even
though u1,t has the same time index as y1,t , i.e. t, we adopt the convention that u1,t is
realized prior to y1,t . That is, there must be some positive time interval δ > 0, no matter
how small, separating these realizations. For example, δ can represent the amount of
time it takes to compute y1,t once all its determinants are in place. Strictly speaking,
then, we could write u1,t−δ in place of u1,t , but for notational convenience, we leave
this implicit. We refer to this as “contemporaneous" causation to distinguish it from
instantaneous causation.
2. For example, (∂/∂y1,t−1 )r1,t (yt−1 , et ; a) = a11 can be interpreted as the marginal effect of y1,t−1 on y1,t .
114
Linking Granger Causality and the Pearl Causal Model with Settable Systems
A common focus of interest when applying structural VARs is to learn the coef-
ficient vector a. In applications, it is typically assumed that the realizations {yt } are
observed, whereas {ut } is unobserved. The least squares estimator for a sample of size
T , say âT , is commonly used to learn (estimate) a in such cases. This estimator is a
straightforward function of yT , say âT = ra,T (yT ). If {ut } is generated as a realization
of a sequence of mean zero finite variance independent identically distributed (IID)
random variables, then âT generally converges to a with probability one as T → ∞,
implying that a can be fully learned in the limit. Viewing âT as causally determined by
yT , we see that we require a countable number of units to treat this learning problem.
As these examples demonstrate, the PCM exhibits a number of features that limit
its applicability to systems involving optimization, equilibrium, and learning. These
limitations motivate a variety of features of settable systems, extending the PCM in
ways that permit straightforward treatment of such systems. We now turn to a more
complete description of the SS framework.
115
White Chalak Lu
The principal unit i = 0 also plays a key role. We let the principal setting Z0 and
principal response Y0 of the principal settable variable X0 be such that Z0 : Ω0 → Ω0
is the identity map, Z0 (ω0 ) := ω0 , and we define Y0 (ω) := Z0 (ω0 ). The setting Z0 of
the principal settable variable may directly influence all other responses in the system,
whereas its response Y0 is unaffected by other settings. Thus, X0 supports introducing
an aspect of “pure randomness” to responses of settable variables.
Definition 3.1 (Elementary Settable System) Let A be a set and let attributes a ∈ A
be given. Let n ∈ N̄+ be given, and let (Ω, F , Pa ) be a complete probability space such
that Ω := ×ni=0 Ωi , with each Ωi a copy of the principal space Ω0 , containing at least
two elements.
Let the principal setting Z0 : Ω0 → Ω0 be the identity mapping. For i = 1, 2, . . . , n,
let Si be a multi-element Borel-measurable subset of R and let settings Zi : Ωi → Si be
surjective measurable functions. Let Z(i) be the vector including every setting except Zi
and taking values in S(i) ⊆ Ω0 × j!i S j , S(i) ! ∅. Let response functions ri ( · ; a) : S(i) → Si
be measurable functions and define responses Yi (ω) := ri (Z(i) (ω); a). Define settable
variables Xi : {0, 1} × Ω → Si as
Xi (0, ω) := Yi (ω) and Xi (1, ω) := Zi (ωi ), ω ∈ Ω.
Define Y0 and X0 by Y0 (ω) := X0 (0, ω) := X0 (1, ω) := Z0 (ω0 ), ω ∈ Ω.
Put X := {X0 , X1 , . . .}. The triple S := {(A, a), (Ω, F , Pa ), X} is an elementary set-
table system.
116
Linking Granger Causality and the Pearl Causal Model with Settable Systems
Definition 3.2 (Partitioned Settable System) Let (A, a), (Ω, F , Pa ), X0 , n, and Si , i =
1, . . . , n, be as in Definition 3.1. Let Π = {Πb } be a partition of {1, . . . , n}, with cardinality
B ∈ N̄+ (B := #Π).
For i = 1, 2, . . . , n, let ZiΠ be settings and let Z(b)
Π be the vector containing Z and
0
Π Π Π
Zi , i " Πb , and taking values in S(b) ⊆ Ω0 ×i"Πb Si , S(b) ! ∅, b = 1, . . . , B. For b = 1, . . . , B
and i ∈ Πb , suppose there exist measurable functions riΠ ( · ; a) : SΠ (b) → Si , specific to Π
such that responses YiΠ (ω) are jointly determined as
XΠ Π
i (0, ω) := Yi (ω) and XΠ Π
i (1, ω) := Zi (ωi ) ω ∈ Ω.
Put XΠ := {X0 , XΠ Π Π
1 , X2 . . .}. The triple S := {(A, a), (Ω, F ), (Π, X )} is a partitioned
settable system.
Π may be partition-specific; this is especially relevant when the ad-
The settings Z(b)
Π
missible set S(b) imposes restrictions on the admissible values of Z(b) Π . Crucially, re-
sponse functions and responses are partition-specific. In Definition 3.2, the joint re-
Π := (r Π , i ∈ Π ) specifies how the settings Z Π outside of block Π
sponse function r[b] i b (b) b
determine the joint response Y[b]Π := (Y Π , i ∈ Π ), i.e., Y Π = r Π (Z Π ; a). For convenience
i b [b] [b] (b)
below, we let Π0 = {0} represent the block corresponding to X0 .
Example 3.2 makes use of partitioning. Here, we have n = 4 settable variables with
B = 2 blocks. Let settable variables 1 and 2 correspond to beer and pizza consumption,
respectively, and let settable variables 3 and 4 correspond to price and income. The
agent partition groups together all variables under the control of a given agent. Let the
consumer be agent 2, so Π2 = {1, 2}. Let the rest of the economy, determining price and
income, be agent 1, so Π1 = {3, 4}. The agent partition is Πa = {Π1 , Π2 }. Then for block
2,
y1 = Y1a (ω) = r1a (Z0 (ω0 ), Z3a (ω3 ), Z4a (ω4 ); a) = r1a (p, m; a)
y2 = Y2a (ω) = r2a (Z0 (ω0 ), Z3a (ω3 ), Z4a (ω4 ); a) = r2a (p, m; a)
represents the joint demand for beer and pizza (belonging to block 2) as a function of
settings of price and income (belonging to block 1). This joint demand is unique under
117
White Chalak Lu
y3 = Y3a (ω) = r3a (Z0 (ω0 ), Z1a (ω1 ), Z2a (ω2 ); a) = r3a (z0 ; a)
y4 = Y4a (ω) = r4a (Z0 (ω0 ), Z1a (ω1 ), Z2a (ω2 ); a) = r4a (z0 ; a).
In this example, price and income are not determined by the individual consumer’s
demands, so although Z1a (ω1 ) and Z2a (ω2 ) formally appear as allowed arguments of ria
after the second equality, we suppress these in writing ria (z0 ; a), i = 3, 4. Here, price
and income responses (belonging to block 1) are determined solely by block 0 settings
z0 = Z0 (ω0 ) = ω0 . This permits price and income responses to be randomly distributed,
under the control of Pa .
It is especially instructive to consider the elementary partition for this example,
Πe = {{1}, {2}, {3}, {4}}, so that Πi = {i}, i = 1, . . . , 4. The elementary partition specifies
how each system variable freely responds to settings of all other system variables. In
particular, it is easy to verify that when consumption of pizza is set to a given level,
the consumer’s optimal response is to spend whatever income is left on beer, and vice
versa. Thus,
y1 = r1e (Z0 (ω0 ), Z2e (ω2 ), Z3e (ω3 ), Z4e (ω4 ); a) = r1e (z2 , p, m; a) = m − pz2
y2 = r2e (Z0 (ω0 ), Z1e (ω2 ), Z3e (ω3 ), Z4e (ω4 ); a) = r2e (z1 , p, m; a) = (m − z1 )/p.
Replacing (y1 , y2 ) with (z1 , z2 ), we see that this system does not have a unique fixed
point, as any (z1 , z2 ) such that m = z1 + pz2 satisfies both
Causal discourse in the PCM is ruled out by the lack of a fixed point. Nevertheless,
the settable systems framework supports the natural economic causal discourse here
about effects of prices, income, and, e.g., pizza consumption on beer demand. Further,
in settable systems, the governing principle of optimization (embedded in a) ensures
that the response functions for both the agent partition and the elementary partition are
mutually consistent.
118
Linking Granger Causality and the Pearl Causal Model with Settable Systems
Without loss of generality, we can represent canonical responses and settings solely as
a function of ω0 , so that
Π Π Π Π
Z[b] (ω0 ) = Y[b] (ω0 ) := r[b] (Z[0:b−1] (ω0 ); a), b = 1, . . . , B.
The canonical representation drops the distinction between settings and responses; we
write
Π Π Π
Y[b] = r[b] (Y[0:b−1] ; a), b = 1, . . . , B.
It is easy to see that the structural VAR of Example 3.3 corresponds to the canoni-
cal representation of a canonical settable system. The canonical responses y0 and {ut }
belong to the first block, and canonical responses yt = (y1,t , y2,t ) belong to block t + 1,
t = 1, 2, . . . Example 3.3 implements the time partition, where joint responses for a given
time period depend on previous settings.
119
White Chalak Lu
Definition 4.1 (Direct Causality) Let S be a partitioned settable system. For given
positive integer b, let j ∈ Πb . (i) For given i " Πb , Xi directly causes X j in S if there
exists an admissible intervention z(b) → z∗(b);i such that
r j (z∗(b);i ; a) − r j (z(b) ; a) ! 0,
D
and we write Xi ⇒S X j . Otherwise, we say Xi does not directly cause X j in S and
D D
write Xi "S X j . (ii) For i, j ∈ Πb , Xi "S X j .
120
Linking Granger Causality and the Pearl Causal Model with Settable Systems
Pearl, J. (2000, 2001). To distinguish the settable system direct causality concept
from Pearl’s notion and later from Granger causality, we follow WL and refer to di-
rect causality in settable systems as direct structural causality.
X is a direct cause of Y if there exist two values x and x' of X and a value
u of U such that Y xr (u) ! Y x' r (u), where r is some realization of V\{X, Y}.
To make this statement fully meaningful requires applying Pearl’s (2000) definitions
7.1.2 (Submodel) and 7.1.4 (Potential Response) to arrive at the potential response,
Y xr (u). For brevity, we do not reproduce Pearl’s definitions here. Instead, it suffices
to map Y xr (u) and its elements to their settable system counterparts. Specifically, u
corresponds to (a, z0 ); x corresponds to zi ; r corresponds to the elements of z(b) other
than z0 and zi , say z(b)(i,0) ; and, provided it exists, Y xr (u) corresponds to r j (z(b) ; a).
The caveat about the existence of Y xr (u) is significant, as Y xr (u) is not defined in
the absence of a unique fixed point for the system. Further, even with a unique fixed
point, the potential response Y xr (u) must also uniquely solve a set of equations denoted
F x (see Pearl, J., 2000, eq. (7.1)) for a submodel, and there is no general guarantee of
such a solution. Fortunately, however, this caveat matters only for non-recursive PCMs.
In the recursive case relevant for G−causality, the potential response is generally well
defined.
Making a final identification between x' and z∗i , and given the existence of potential
responses Y x' r (u) and Y xr (u), we see that Y x' r (u) ! Y xr (u) corresponds to the settable
systems requirement r j (z∗(b);i ; a) − r j (z(b) ; a) ! 0.
Pearl, J. (2001, definition 1) gives a formal statement of the notion stated above,
saying that if for given u and some r, x, and x' we have Y xr (u) ! Y x' r (u) then X has a
controlled direct effect on Y in model M and situation U = u. In definition 2, Pearl,
J. (2001) labels Y x' r (u) − Y xr (u) the controlled direct effect, corresponding to the direct
structural effect r j (z∗(b);i ; a) − r j (z(b) ; a) defined for settable systems.
Thus, although there are important differences, especially in non-recursive systems,
the settable systems and PCM notions of direct causality and direct effects closely cor-
respond in recursive systems. These differences are sufficiently modest that the results
of WL linking direct structural causality to Granger causality, discussed next, also serve
to closely link the PCM notion of direct cause to that of Granger causality.
121
White Chalak Lu
Here, we focus only on the k = 0 case, as this is what relates generally to structural
causality.
As Florens, J.P. and M. Mouchart (1982) and Florens, J.P. and D. Fougère (1996)
note, G non-causality is a form of conditional independence. Following Dawid (1979),
we write X ⊥ Y | Z when X and Y are independent given Z. Translating (2) gives the
following version of the classical definition of Granger causality:
122
Linking Granger Causality and the Pearl Causal Model with Settable Systems
Then we say Q does not finite-order G−cause Y with respect to S . Otherwise, we say
Q finite-order G−causes Y with respect to S .
Yt = qt (Y t−1 , Z t , Ut ), t = 1, 2, . . . , (4)
' , Y ' )' and U := (U ' , U ' )' ,
such that, with Yt := (Y1,t 2,t t 1,t 2,t
Such structures are well suited to representing the structural evolution of time-series
data in economic, biological, or other systems. Because Yt is a vector, this covers the
123
White Chalak Lu
case of panel data, where one has a cross-section of time-series observations, as in fMRI
or EEG data sets. For practical relevance, we explicitly impose the Markov assumption
that Yt is determined by only a finite number of its own lags and those of Zt and Ut . WL
discuss the general case.
Throughout, we suppose that realizations of Wt , Yt , and Zt are observed, whereas
realizations of Ut are not. Because Ut , Wt , or Zt may have dimension zero, their pres-
ence is optional. Usually, however, some or all will be present. Since there may be a
countable infinity of unobservables, there is no loss of generality in specifying that Yt
depends only on Ut rather than on a finite history of lags of Ut .
This structure is general: the structural relations may be nonlinear and non-
monotonic in their arguments and non-separable between observables and unobserv-
ables. This system may generate stationary processes, non-stationary processes, or
both. Assumption A.1 is therefore a general structural VAR; Example 3.3 is a special
case.
The vector Yt represents responses of interest. Consistent with a main application
of G−causality, our interest here attaches to the effects on Y1,t of the lags of Y2,t . We
thus call Y2,t−1 and its further lags “causes of interest." Note that A.1 specifies that Y1,t
and Y2,t each have their own unobserved drivers, U1,t and U2,t , as is standard.
The vectors Ut and Zt contain causal drivers of Yt whose effects are not of primary
interest; we thus call Ut and Zt “ancillary causes." The vector Wt may contain responses
to Ut . Observe that Wt does not appear in the argument list for qt , so it explicitly
does not directly determine Yt . Note also that Yt ⇐ (Y t−1 , U t , W t , Z t ) ensures that Wt is
not determined by Yt or its lags. A useful convention is that Wt ⇐ (W t−1 , U t , Z t ), so
that Wt does not drive unobservables. If a structure does not have this property, then
suitable substitutions can usually yield a derived structure satisfying this convention.
Nevertheless, we do not require this, so Wt may also contain drivers of unobservable
causes of Yt .
For concreteness, we now specialize the settable systems definition of direct struc-
tural causality (Definition 4.1) to the specific system given in A.1. For this, let y s,t−1 be
the sub-vector of yt−1 with elements indexed by the non-empty set s ⊆ {1, . . . , ky } × {t −
1, . . . , t − 1}, and let y(s),t−1 be the sub-vector of yt−1 with elements of s excluded.
Definition 5.3 (Direct Structural Causality) Given A.1, for given t > 0, j ∈ {1, . . . , ky },
and s, suppose that for all admissible values of y(s),t−1 , zt , and ut , the function y s,t−1 →
q j,t (yt−1 , zt , ut ) is constant in y s,t−1 . Then we say Y s,t−1 does not directly structurally
D
cause Y j,t and write Y s,t−1 "S Y j,t . Otherwise, we say Y s,t−1 directly structurally
D
causes Y j,t and write Y s,t−1 ⇒S Y j,t .
We can similarly define direct causality or non-causality of Z s,t or U s,t for Y j,t , but
D D
we leave this implicit. We write, e.g., Y s,t−1 ⇒S Yt when Y s,t−1 ⇒S Y j,t for some
j ∈ {1, . . . , ky }.
Building on work of White, H. (2006a) and White, H. and K. Chalak (2009), WL
discuss how certain exogeneity restrictions permit identification of expected causal ef-
124
Linking Granger Causality and the Pearl Causal Model with Settable Systems
fects in dynamic structures. Our next result shows that a specific form of exogeneity
enables us to link direct structural causality and finite order G−causality. To state this
exogeneity condition, we write Y 1,t−1 := (Y1,t−1 , . . . , Y1,t−1 ), Y 2,t−1 := (Y2,t−1 , . . . , Y2,t−1 ),
and, for given τ1 , τ2 ≥ 0, Xt := (Xt−τ1 , . . . , Xt+τ2 ), where Xt := (Wt' , Zt' )' .
Assumption A.2 For 1 and m as in A.1 and for τ1 ≥ m, τ2 ≥ 0, suppose that Y 2,t−1 ⊥
U1,t | (Y 1,t−1 , Xt ), t = 1, . . . , T − τ2 .
The classical strict exogeneity condition specifies that (Y t−1 , Z t ) ⊥ U1,t , which implies
Y 2,t−1 ⊥ U1,t | (Y 1,t−1 , Z t ). (Here, Wt can be omitted.) Assumption A.2 is a weaker
requirement, as it may hold when strict exogeneity fails. Because of the conditioning
involved, we call this conditional exogeneity. Chalak, K. and H. White (2010) discuss
structural restrictions for canonical settable systems that deliver conditional exogeneity.
Below, we also discuss practical tests for this assumption.
Because of the finite numbers of lags involved in A.2, this is a finite-order con-
ditional exogeneity assumption. For convenience and because no confusion will arise
here, we simply refer to this as “conditional exogeneity."
Assumption A.2 ensures that expected direct effects of Y 2,t−1 on Y1,t are identified.
As WL note, it suffices for A.2 that U t−1 ⊥ U1,t | (Y0 , Z t−1 , Xt ) and Y 2,t−1 ⊥ (Y0 , Z t−τ1 −1 ) |
(Y 1,t−1 , Xt ). Imposing U t−1 ⊥ U1,t | (Y0 , Z t−1 , Xt ) is the analog of requiring that serial
correlation is absent when lagged dependent variables are present. Imposing Y 2,t−1 ⊥
(Y0 , Z t−τ1 −1 ) | (Y 1,t−1 , Xt ) ensures that ignoring Y0 and omitting distant lags of Zt from
Xt doesn’t matter.
Our first result linking direct structural causality and G−causality shows that, given
A.1 and A.2 and with proper choice of Qt and S t , G−causality implies direct structural
causality.
D
Proposition 5.4 Let A.1 and A.2 hold. If Y 2,t−1 "S Y1,t , t = 1, 2, . . ., then Y 2 does not
finite order G−cause Y1 with respect to X, i.e.,
125
White Chalak Lu
Definition 5.5 Suppose A.1 holds and that for given τ1 ≥ m, τ2 ≥ 0 and for each y ∈
supp(Y1,t ) there exists a σ(Y 1,t−1 , Xt )−measurable version of the random variable
F
1{q1,t (Y t−1 , Z t , u1,t ) < y} dF1,t (u1,t | Y 1,t−1 , Xt ).
D
Then Y 2,t−1 "S(Y 1,t−1 ,Xt ) Y1,t (direct non-causality−σ(Y 1,t−1 , Xt ) a.s.). If not, Y 2,t−1
D
⇒S(Y 1,t−1 ,Xt ) Y1,t .
For simplicity, we refer to this as direct non-causality a.s. The requirement that the
integral in this definition is σ(Y 1,t−1 , Xt )−measurable means that the integral does not
depend on Y 2,t−1 , despite its appearance inside the integral as an argument of q1,t . For
this, it suffices that Y 2,t−1 does not directly cause Y1,t ; but it is also possible that q1,t
and the conditional distribution of U1,t given Y 1,t−1 , Xt are in just the right relation to
hide the structural causality. Without the ability to manipulate this distribution, the
structural causality will not be detectable. One possible avenue to manipulating this
distribution is to modify the choice of Xt , as there are often multiple choices for Xt
that can satisfy A.2 (see White, H. and X. Lu, 2010b). For brevity and because hidden
structural causality is an exceptional circumstance, we leave aside further discussion of
this possibility here. The key fact to bear in mind is that the causal concept of Definition
5.5 distinguishes between those direct causal relations that are empirically detectable
and those that are not, for a given set of covariates Xt .
We now give a structural characterization of G−causality for structural VARs:
D
Theorem 5.6 Let A.1 and A.2 hold. Then Y 2,t−1 "S(Y 1,t−1 ,Xt ) Y1,t , t = 1, . . . , T − τ2 , if
and only if
Y1,t ⊥ Y 2,t−1 | Y 1,t−1 , Xt , t = 1, . . . , T − τ2 ,
i.e., Y 2 does not finite-order G−cause Y1 with respect to X.
126
Linking Granger Causality and the Pearl Causal Model with Settable Systems
First, we specify the sense in which conditional exogeneity is necessary for the
equivalence of G−causality and direct structural causality.
D
Proposition 5.7 Given A.1, suppose that Y 2,t−1 "S Y1,t , t = 1, 2, . . . . If A.2 does not
hold, then for each t there exists q1,t such that Y1,t ⊥ Y 2,t−1 | Y 1,t−1 , Xt does not hold.
That is, if conditional exogeneity does not hold, then there are always structures that
generate data exhibiting G−causality, despite the absence of direct structural causality.
Because q1,t is unknown, this worst case scenario can never be discounted. Further, as
WL show, the class of worst case structures includes precisely those usually assumed
in applications, namely separable structures (e.g., Y1,t = q1,t (Y 1,t−1 , Z t ) + U1,t ), as well
as the more general class of invertible structures. Thus, in the cases typically assumed
in the literature, the failure of conditional exogeneity guarantees G−causality in the
absence of structural causality. We state this formally as a corollary.
D
Corollary 5.8 Given A.1 with Y 2,t−1 "S Y1,t , t = 1, 2, . . . , suppose that q1,t is invertible
in the sense that Y1,t = q1,t (Y 1,t−1 , Z t , U1,t ) implies the existence of ξ1,t such that U1,t =
ξ1,t (Y 1,t−1 , Z t , Y1,t ), t = 1, 2, . . . . If A.2 fails, then Y1,t ⊥ Y 2,t−1 | Y 1,t−1 , Xt fails, t = 1, 2, . . ..
Together with Theorem 5.6, this establishes that in the absence of direct causality and
for the class of invertible structures predominant in applications, conditional exogeneity
is necessary and sufficient for G non-causality.
Tests of conditional exogeneity for the general separable case follow from:
Proposition 5.9 Given A.1, suppose that E(Y1,t ) < ∞ and that
where ζt and υt are unknown measurable functions. Let εt := Y1,t − E(Y1,t |Y t−1 , Xt ). If
A.2 holds, then
Tests based on this result detect the failure of A.2, given separability. Such tests are
feasible because even though the regression error εt is unobserved, it can be consistently
estimated, say as ε̂t := Y1,t − Ê(Y1,t |Y t−1 , Xt ), where Ê(Y1,t |Y t−1 , Xt ) is a parametric or
nonparametric estimator of E(Y1,t |Y t−1 , Xt ). These estimated errors can then be used to
test (5). If we reject (5), then we must reject A.2. We discuss a practical procedure in
the next section. WL provide additional discussion.
WL also discuss dropping the separability assumption. For brevity, we maintain
separability here. Observe that under the null of direct non-causality, q1,t is necessarily
separable, as then ζt is the zero function.
127
White Chalak Lu
If these rejection conditions do not hold, however, we cannot just decide to “accept"
(i.e., fail to reject) SN. As WL explain in detail, difficulties arise when CE and GN both
fail, as failing to reject SN here runs the risk of Type II error, whereas rejecting SN runs
the risk of Type I error. We resolve this dilemma by specifying the further rules:
In the latter case, we conclude only that CE and GN both fail, thereby obstructing struc-
tural inference. This sends a clear signal that the researcher needs to revisit the model
specification, with particular attention to specifying covariates sufficient to ensure con-
ditional exogeneity.
Because of the structure of this indirect test, it is not enough simply to consider its
level and power. We must also account for the possibility of making no decision. For
this, define
128
Linking Granger Causality and the Pearl Causal Model with Settable Systems
These are the analogs of the probabilities of Type I and Type II errors for the “no
decision" action. We would like these probabilities to be small. Next, we consider
These quantities correspond to notions of level and power, but with the sample space
restricted to the subset on which CE is true or GN is true, that is, the space where a
decision can be made. Thus, α∗ differs from the standard notion of level, but it does
capture the probability of taking an incorrect action when SN (the null) holds in the
restricted sample space, i.e., when CE and GN are both true. Similarly, π∗ captures the
probability of taking the correct action when SN does not hold in the restricted sample
space. We would like the “restricted level" α∗ to be small and the “restricted power" π∗
to be close to one.
WL provide useful bounds on the asymptotic properties (T → ∞) of the sample-size
T values of the probabilities defined above, pT , qT , α∗T , and π∗T :
Proposition 6.1 Suppose that for T = 1, 2, . . . the significance levels of the CE and GN
tests are α1T and α2T , respectively, and that α1T → α1 < .5 and α2T → α2 < .5. Suppose
the powers of the CE and GN tests are π1T and π2T , respectively, and that π1T → 1 and
π2T → 1. Then
When π1T → 1 and π2T → 1, one can also typically ensure α1 = 0 and α2 = 0 by suitable
choice of an increasing sequence of critical values. In this case, qT → 0, α∗T → 0, and
π∗T → 1. Because GN and CE tests will not be consistent against every possible alterna-
tive, weaker asymptotic bounds on the level and power of the indirect test hold for these
cases by Proposition 8.1 of WL. Thus, whenever possible, one should carefully design
GN and CE tests to have power against particularly important or plausible alternatives.
See WL for further discussion.
129
White Chalak Lu
variables. In practice, researchers typically use parametric methods. These are conve-
nient, but they may lack power against important alternatives. To provide convenient
procedures for testing GN and CE with power against a wider range of alternatives,
WL propose augmenting standard tests with neural network terms, motivated by the
“QuickNet" procedures introduced by White, H (2006b) or the extreme learning ma-
chine (ELM) methods of Huang, G.B., Q.Y. Zhu, and C.K. Siew (2006). We now pro-
vide explicit practical methods for testing GN and CE for a leading class of structures
obeying A.1.
For simplicity, we let Y1,t be a scalar here. The extension to the case of vector Y1,t is
completely straightforward. Under the null of GN, i.e., Y1,t ⊥ Y 2,t−1 | Y 1,t−1 , Xt , we have
β0 = 0. The standard procedure therefore tests β0 = 0 in the regression equation
If we reject β0 = 0, then we also reject GN. But if we don’t reject β0 = 0, care is needed,
as not all failures of GN will be indicated by β0 ! 0.
Observe that when CE holds and if GN Test Regression 1 is correctly specified, i.e.,
the conditional expectation E(Y1,t |Y t−1 , Xt ) is indeed linear in the conditioning vari-
ables, then β0 represents precisely the direct structural effect of Y 2,t−1 on Y1,t . Thus,
GN Test Regression 1 may not only permit a test of GN, but it may also provide a
consistent estimate of the direct structural effect of interest.
To mitigate specification error and gain power against a wider range of alterna-
tives, WL propose augmenting GN Test Regression 1 with neural network terms, as in
White’s (2006b, p. 476) QuickNet procedure. This involves testing β0 = 0 in
'
r
Y1,t = α0 + Y '1,t−1 ρ0 + Y '2,t−1 β0 + X't β1 + ψ(Y '1,t−1 γ1, j + X't γ j )β j+1 + εt .
j=1
(GN Test Regression 2)
130
Linking Granger Causality and the Pearl Causal Model with Settable Systems
Parallel to our comment above about estimating direct structural effects of interest,
we note that given A.1, A.2, and some further mild regularity conditions, such effects
can be identified and estimated from a neural network regression of the form
Observe that this regression includes Y 2,t−1 inside the hidden units. With r chosen suf-
ficiently large, this permits the regression to achieve a sufficiently close approximation
to E(Y1,t |Y t−1 , Xt ) and its derivatives (see Hornik, K. M. Stinchcombe, and H. White,
1990; Gallant, A.R. and H. White, 1992) that regression misspecification is not such an
issue. In this case, the derivative of the estimated regression with respect to Y 2,t−1 well
approximates
This quantity is the covariate conditioned expected marginal direct effect of Y 2,t−1 on
Y1,t .
Although it is possible to base a test for GN on these estimated effects, we do not
propose this here, as the required analysis is much more involved than that associated
with GN Test Regression 2.
Finally, to gain additional power WL propose tests using transformations of Y1,t ,
Y 1,t−1 , and Y 2,t−1 , as Y1,t ⊥ Y 2,t−1 | Y 1,t−1 , Xt implies f (Y1,t ) ⊥ g(Y 2,t−1 ) | Y 1,t−1 , Xt for
all measurable f and g. One then tests β1,0 = 0 in
ψ1 (Y1,t ) = α1,0 + ψ2 (Y 1,t−1 )' ρ1,0 + ψ3 (Y 2,t−1 )' β1,0 + X't β1,1
'r
+ ψ(Y '1,t−1 γ1,1, j + X't γ1, j )β1, j+1 + ηt . (GN Test Regression 3)
j=1
We take ψ1 and the elements of the vector ψ3 to be GCR, e.g., ridgelets or the logistic
cdf. The choices of γ, r, and ψ are as described above. Here, ψ2 can be the identity
(ψ2 (Y 1,t−1 ) = Y 1,t−1 ), its elements can coincide with ψ1 , or it can be a different GCR
function.
131
White Chalak Lu
εt | Y 1,t−1 , Xt . If we reject this, then we also must reject CE. We describe the procedure
in detail below.
As WL discuss, such a procedure is not “watertight," as this method may miss cer-
tain alternatives to CE. But, as it turns out, there is no completely infallible method. By
offering the opportunity of falsification, this method provides crucial insurance against
being naively misled into inappropriate causal inferences. See WL for further discus-
sion.
The first step in constructing a practical test for CE is to compute estimates of εt ,
say ε̂t . This can be done in the obvious way by taking ε̂t to be the estimated residuals
from a suitable regression. For concreteness, suppose this is either GN Test Regression
1 or 2.
The next step is to use ε̂t to test Y 2,t−1 ⊥ εt | Y 1,t−1 , Xt . WL recommend doing this
by estimating the following analog of GN Test Regression 3:
ψ1 (ε̂t ) = α2,0 + ψ2 (Y 1,t−1 )' ρ2,0 + ψ3 (Y 2,t−1 )' β2,0 + X't β2,1
'r
+ ψ(Y '1,t−1 γ2,1, j + X't γ2, j )β2, j+1 + ηt . (CE Test Regression)
j=1
Note that the right-hand-side regressors are identical to those of GN Test Regression 3;
we just replace the dependent variable ψ1 (Y1,t ) for GN with ψ1 (ε̂t ) for CE. Nevertheless,
the transformations ψ1 , ψ2 , and ψ3 here may differ from those of GN Test Regression
3. To keep the notation simple, we leave these possible differences implicit. To test CE
using this regression, we test the null hypothesis β2,0 = 0 : if we reject β2,0 = 0, then we
reject CE.
As WL explain, the fact that ε̂t is obtained from a “first-stage" estimation (GN) in-
volving potentially the same regressors as those appearing in the CE regression means
that choosing ψ1 (ε̂t ) = ε̂t can easily lead to a test with no power. For CE, WL thus rec-
ommend choosing ψ1 to be GCR. Alternatively, non-GCR choices may be informative,
such as
ψ1 (ε̂t ) = |ε̂t |, ψ1 (ε̂t ) = ε̂t (λ − 1{ε̂t < 0}), λ ∈ (0, 1), or ψ1 (ε̂t ) = ε̂2t .
Significantly, the asymptotic sampling distributions needed to test β2,0 = 0 will gen-
erally be impacted by the first-stage estimation. Handling this properly is straightfor-
ward, but somewhat involved. To describe a practical method, we denote the first-stage
(GN) estimator as θ̂1,T := (α̂1,T , ρ̂1,T , β̂'1,0,T , β̂'1,1,T , . . . , β̂1,r+1,T )' , computed from GN Test
Regression 1 (r = 0) or 2 (r > 0). Let the second stage (CE) regression estimator be θ̂2,T ;
this contains the estimated coefficients for Y 2,t−1 , say β̂2,0,T , which carry the information
about CE. Under mild conditions, a central limit theorem ensures that
√ d
T (θ̂T − θ0 ) → N(0,C0 ),
' , θ̂' )' , θ := plim(θ̂ ), convergence in distribution as T → ∞ is de-
where θ̂T := (θ̂1,T 2,T 0 T
d
noted →, and N(0,C0 ) denotes the multivariate normal distribution with mean zero and
132
Linking Granger Causality and the Pearl Causal Model with Settable Systems
4. The regularity conditions include plausible memory and moment requirements, together with certain
smoothness and other technical conditions.
133
White Chalak Lu
We then reject CE if TT > ĉT,n,1−α , where, with n chosen sufficiently large, ĉT,n,1−α is
the 1 − α percentile of the weighted bootstrap statistics
TT,i := T (θ̂T,i − θ̂T )' S'2 S2 (θ̂T,i − θ̂T ) = T (β̂2,0,T,i − β̂2,0,T )' (β̂2,0,T,i − β̂2,0,T ), i = 1, . . . , n.
134
Linking Granger Causality and the Pearl Causal Model with Settable Systems
groups of neurons. WL also examine the structural content of classical Granger causal-
ity and a variety of related alternative versions that emerge naturally from different
versions of Assumption A.1.
Acknowledgments
We express our deep appreciation to Sir Clive W.J. Granger for his encouragement of
the research underlying the work presented here.
References
E. Candès. Ridgelets: Estimating with Ridge Functions. Annals of Statistics, 31:1561–
1599, 1999.
A. P. Dawid. Beware of the DAG! Proceedings of the NIPS 2008 Workshop on Causal-
ity, Journal of Machine Learning Research Workshop and Conference Proceedings,
6:59–86, 2010.
M. Eichler. Granger Causality and Path Diagrams for Multivariate Time Series. Journal
of Econometrics, 137:334-353, 2007.
A.R. Gallant and H. White. On Learning the Derivatives of an Unknown Mapping with
Multilayer Feedforward Networks. Neural Networks, 5:129–138, 1992.
135
White Chalak Lu
R. Gibbons. Game Theory for Applied Economists. Princeton University Press, Prince-
ton, 1992.
S. Gonçalves and H. White. Maximum Likelihood and the Bootstrap for Nonlinear
Dynamic Models. Journal of Econometrics, 119:199–219, 2004.
G. B. Huang, Q.Y. Zhu, and C.K. Siew. Extreme Learning Machines: Theory and
Applications. Neurocomputing, 70:489–501, 2006.
136
Linking Granger Causality and the Pearl Causal Model with Settable Systems
H. White and K. Chalak. Settable Systems: An Extension of Pearl’s Causal Model with
Optimization, Equilibrium, and Learning. Journal of Machine Learning Research,
10:1759-1799, 2009.
H. White and X. Lu. Granger Causality and Dynamic Structural Systems. Journal of
Financial Econometrics, 8:193-243, 2010a.
H. White and X. Lu. Causal Diagrams for Treatment Effect Estimation with Application
to Selection of Efficient Covariates. Technical report, Department of Economics,
University of California, San Diego, 2010b.
137
138